Finding similar entities in BigData models

Ovidiu Petridean
Senior Software Developer @ SDL Research

PROGRAMMING

The purpose of the article is to present a way to find similar entities in BigData models. We will present an algorithm used to find similar sentences in a very large data corpus. Even though the examples presented are focused on finding similar sentences, this algorithm can be used to find any kind of entities that can be described by a set of characteristics.

The Problem

How do you find similar sentences in a very large data corpus (peta-bytes of data)? An important problem that arises when we search for similar items of any kind is that there may be far too many pairs of items to test each pair for their degree of similarity, even if computing the similarity of any one pair can be made very easy.

The solution

A fundamental data-mining problem is to examine data for "similar" items. So, by similar items we can refer to sentences or documents or something totally different. Eventually we will come to the conclusion that the problem of finding similar sentences can be generalized to a problem of finding similar items.

General knowledge

Data mining The most commonly accepted definition of "data mining" is the discovery of "models" for data. A "model," however, can be one of several things: statistical modeling, machine learning, summarization (e.g. Google's PageRank), feature extraction.

Hash Functions A hash function h takes a hash-key value as an argument and produces a bucket number as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number of buckets.

Similar Items To define similar items we will use the Jaccard Similarity. The Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by SIM(S, T ).

Example:

Let's look at two sets S and T . There are three elements in their intersection and a total of eight elements that appear in S or T or both. Thus, SIM(S, T ) = 3/8.

Similarity of Documents (sentences) An important class of problems that Jaccard similarity addresses well is that of finding textually similar documents in a large corpus such as the Web or a collection of documents. Here are some examples: Plagiarism, Mirror Pages, Articles from the Same Source.

The algorithm for finding similar items

The algorithm has three main steps: "Shingling of Documents", "minhashing" and "locality-sensitive hashing". To present the algorithm we will use a practical example of finding similar sentences in a large data corpus.

1. Shingling of Documents (Sentences)

The most effective way to represent documents as sets, for the purpose of identifying lexically similar documents is to construct from the document the set of short strings that appear within it. In our particular use-case we have referred to words as shingles.

Hashing Shingles

Instead of using substrings directly as shingles, we can pick a hash function that maps strings of length k to some number of buckets and treat the resulting bucket number as the shingle.

Similarity-Preserving Summaries of Sets

Sets of shingles are large. Even if we hash them to four bytes each, the space needed to store a set is still roughly four times the space taken by the document. Our goal is to replace large sets by much smaller representations called "signatures." The important property we need for signatures is that we can compare the signatures of two sets and estimate the Jaccard similarity of the underlying sets from the signatures alone.

Matrix Representation of Sets

It is helpful to visualize a collection of sets as their characteristic matrix. The columns of the matrix correspond to the sets, and the rows correspond to elements of the universal set from which elements of the sets are drawn.

Each of the following examples only uses two sentences to show how to solve the similarity problem. However, the efficiency of the algorithm can be seen when we talk about a large data corpus. We use the algorithm to find similarities in peta-bytes of data

Let's assume the following sentences: S1: "I enjoyed my stay during summer at hotel California" S2: "I enjoyed my stay during winter at hotel Napoca"

We now have the following sets (using the words with ignore case): S1 = {i, enjoyed, my, stay, during, summer, at, hotel, california} S2 = {i, enjoyed, my, stay, during, winter, at, hotel, napoca}

It is important to remember that the characteristic matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. The similarity of these two sets is defined as the number of rows in which the sets agree divided by the total number of rows. So, in this case we conclude that the similarity is 7/11 which is 0.63. The problem is that as said before storing this sets will take a huge amount of memory (or disk space). Therefore we introduce the following technique called "mihashing".

2. Minhashing

The signatures we desire to construct for sets are composed of the results of a large number of calculations, say several hundred, each of which is a "minhash" of the characteristic matrix. To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the first row, in the permuted order, in which the column has a 1.

Example

Let's suppose we pick the following order in the previous matrix: This permutation defines a minhash function h that maps sets to rows. Let us compute the minhash value of set S1 according to h. The first column, which is the column for set S1, has 0 in the first row, so we proceed to row 2, the second in the permuted order. There is again a 0 in the column for S1, so we proceed to row 3, where we find a 1. Thus. h(S1) = "at" and h(S2) = "napoca"

Minhashing and Jaccard Similarity

There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

Minhash Signatures

Again, think of a collection of sets represented by their characteristic matrix M. To represent sets, we pick at random some number n of permutations of the rows of M.

Perhaps 100 permutations or several hundred permutations will do. Call the minhash functions determined by these permutations h1, h2, . . . , hn. From the column representing set S, construct the minhash signature for S, the vector

[h1(S), h2(S), . . . , hn(S)].

Computing Minhash Signatures

It is not feasible to permute a large characteristic matrix explicitly. Fortunately, it is possible to simulate the effect of a random permutation by a random hash function that maps row numbers to as many buckets as there are rows.

Thus, instead of picking n random permutations of rows, we pick n randomly chosen hash functions h1, h2, . . . , hn on the rows. We construct the signature matrix by considering each row in their given order. Let SIG(i, c) be the element of the signature matrix for the ith hash function and column c. Initially, set SIG(i, c) to ∞ for all i and c.

We handle row r by doing the following:

Compute h1(r), h2(r), . . . , hn(r).
For each column c do the following:

a. If c has 0 in row r, do nothing.

b. if c has 1 in row r, then for each i = 1, 2, . . . , n set SIG(i, c) to the smaller of the current value of SIG(i, c) and hi(r).

Let's look at the previous input matrix and compute the minhash signatures.

We pick the following two hash functions h(x) = x mod 11, g(x) = (2*x + 1) mod 11.

Now, let us simulate the algorithm for computing the signature matrix. We will not go through the whole process and just show the first iterations.

Because there are 11 rows, we will have to do 11 calculations of the functions h and g. Initially, this matrix consists of all ∞'s:

Step 1:

Looking at the input matrix we can see that for row 1 both S1 and S2 have 1's, therefore we will calculate the hash functions (h(x) and g(x), where x is the number of the row) for row 1.

So, h(1) = 1 and g(1) = 3, which means that both values are of course smaller than ∞ and so the previous table is changed to this:

Step 2:

We now move to row two where again we see that both S1 and S2 have 1's therefore we need to calculate the hash functions.

Now, h(2) = 2 and g(2) = 5 and because both these values are higher than what we have in the Signature matrix we will not change anything.

Steps 3 and 4 will not change anything and therefore we move to step 5.

Step 5:

As we can see both S1 and S2 have 1's in row 5 and so we compute h(5) and g(5). Because h(5) = 5 and g(5) = 0 we can see that we need to update the minimum value for g since this is smaller than what we currently have.

Steps 6 to 10 will not change anything and so we skip to step 11.

Step 11:

Here we can see that only S2 has a 1 in the 11th row which means that after computing the values for both h(11) and g(11) we will only update S2's values if there is the need to do so.

So, h(11) = 0 and g(11) = 1, which now leads to the final form of the signatures matrix.

Now looking at the previous process we can conclude that the two segments agree in one of the two rows from the signature matrix. This means that their similarity is estimated to 0.5, which is not a bad estimate considering that the Jaccard similarity is 0.63. Of course in order to get better results we can add more functions which lead to a better estimate. Often, we want to focus only on the most similar pairs or all pairs that are above somethreshold in similarity. There is a general theory of how to provide focus, called locality-sensitive hashing (LSH) or near-neighbor search.

3. Locality-sensitive hashing for Minhash Signatures

One general approach to LSH is to "hash" items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair. We check only the candidate pairs for similarity.

If we have minhash signatures for the items, an effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rowseach. For each band, there is a hash function that takes vectors of r integers (the portion of one column within that band) and hashes them to some large number of buckets.

Analysis of the Banding Technique

Suppose we use b bands of r rows each, and suppose that a particular pair of documents have Jaccard similarity s. We can calculate the probability that these sentences (or rather their signatures) become a candidate pair as follows:

The probability that the signatures agree in all rows of one particular band is s r . 2.The probability that the signatures disagree in at least one row of a particular band is 1-sr .
The probability that the signatures disagree in at least one row of each of the bands is(1-sr)b .
The probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair, is 1-(1-sr)b.

The threshold, that is, the value of similarity s at which the probability of becoming a candidate is 1/2, is a function of b and r. An approximation to the threshold is (1/b)1/r. For example, if b = 16 and r = 4, then the threshold is approximately at s = 1/2, since the 4th root of 1/16 is 1/2.

Example

We use the same two sentences but in order to make the example meaningful we use 10 hash functions to compute the signature matrix.

We set the similarity threshold to 0.5 which means that want to find out segments that have at least 0.5 Jaccard similarity.

We also know that we have 10 rows (no of functions). Therefore the combination of the threshold set to 0.5 and the number of functions set to 10 leads the conclusion that we will have 5 rows with two rows for each band.

Step 1 - Calculate the signature matrix

hn(x) = (n*x+1) mod 11, 0<=n<=10

We will now skip to the final step and the final signature matrix looks like this:

Step 2 - Calculate the polynomial functions for each band

Due to the fact that we have 5 bands we will need 5 polynomial functions that take two arguments. The arguments are the numbers that compose the part of the signature that is considered for each segment.

Let's first look at the five bands.

We will choose polynomial functions in the form a1 X1+a2X2+ ... +an*Xn mod m, where n is the number of functions and m is a very large prime.

For this example we will use the following functions:

p1 = 2*x1 + 3*x2 mod 13 p2 = 3*x1 + 5x2 mod 13 p3 = 5x1 + 7*x2 
mod 13 p4 = 7*x1 + 9*x2 mod 13 p5 = 9*x1 + 11*x2 mod 13

So, let's do the math.

(S1) p1 = 2*1 + 3*0 mod 13 = 2 (S2) p1 = 2*0 + 3*0 mod 13 = 0

Let's name the resulting buckets B1-2 and B1-0

Now we move to the second band

(S1) p2 = 3*0 + 5*2 mod 13 = 10 (S2) p2 = 3*0 + 5*0 mod 13 = 0

The resulting buckets are B2-10 and B2-0

For the third band we have the following:

(S1) p3 = 5*0 + 7*0 mod 13 = 0 (S2) p3 = 5*1 + 7*1 mod 13 = 12

The resulting buckets are B3-0 and B3-12

The fourth bucket

(S1) p4 = 7*0 + 9*0 mod 13 = 0 (S2) p4 = 7*0 + 9*0 mod 13 = 0

The interesting fact is that now we only have one bucket B4-0 since both the resulting values are 0.

The fifth bucket:

(S1) p5 = 9*0 + 11*0 mod 13 = 0 (S2) p5 = 9*0 + 11*0 mod 13 = 0

We now also have just one resulting bucket B5-0.

This means that we have the following buckets:

B1-2={S1} 
B1-0={S2} 
B2-10={S1} 
B2-0={S2} 
B3-0={S1} 
B3-12={S2} 
B4-0={S1,S2} 
B5-0={S1,S2}

Now since the two segments have hashed together in at least one bucket (actually B4-0 and B5-0) we will consider them as candidate pairs and calculate the Jaccard distance to find if they are indeed similar.

Conclusions

This algorithm is proven to be the most efficient when it works with a large data corpus.

In order to deal with a large amount of data, it is recommended to use a MapReduce implementation.

The use of this algorithm is not reduced to finding similar sentences. The algorithm can be used for any data set whose items can be individually described as a set of characteristics.

The larger we make the signatures, the higher is the chance to get less false positives and false negatives - but this also involves more computing time and space.

Check that candidate pairs really do have similar signatures by using Jaccard and don't rely only on this algorithm, since we know it produces false positives.