Computational tasks such as machine learning and natural language analysis often depend on measuring how similar two items are. When you search the Web with Google or DuckDuckGo, you don’t just get pages which are an exact match for your search string. You get ones which are, in some sense, similar to your search. They might contain synonyms or related words instead of the search terms.
Measuring similarity requires a metric of distance between items that have some mathematical relationship. The available approaches depend on what kind of relationship is involved.
- Multidimensional space applies to data points which have several independent numeric values. The points are considered part of a plane, a three-dimensional space, or a space with more dimensions, perhaps hundreds.
- Sets describe an item by enumerating its elements. Items are similar to the extent that their sets share the same elements.
- Graphs are structures with points and connecting paths (e.g., trees) that represent items. Items are similar if they’re close together when following paths.
N-dimensional spaces
If data points are described as locations in an n-dimensional space, there are several ways to measure the distance between them.
Euclidean distance treats the space as something analogous to physical space (disregarding relativity and space curvature). The distance between two points is the length of a vector between them. If there are two data items with values x1, y1, z1, … and x2, y2, z2 … then the distance is the square root of
(x2 – x1)^2 + (y2 – y1)^2 + (z2 – z1)^2 …
Manhattan distance is appropriate for situations where there is no “straight line” between data points. Euclidean distance assumes that if one point is “north” and “east” of another, then you can get there by going “northeast.” For many types of data, though, there is no “northeast”; there is only the difference between each pair of variables. It’s like walking on the streets of Manhattan to get from one address to another; you have to go north and go east separately. This calculation is computationally cheaper than Euclidean distance, since it doesn’t require squares or square roots. It’s calculated as
|x2 – x1| + |y2 – y1| + |z2 – z1| …
Chebyshev distance is even simpler. It’s the largest absolute value of any of the differences between the pairs of variables. It’s appropriate when all the values can change simultaneously and the limiting factor is the largest difference. The number of moves it takes a chess king to get from one square to another on a chessboard is an example of Chebyshev distance; it’s the greater of the horizontal and vertical differences between the squares.
All of these can be considered special cases of Minkowski distance, where the calculation is the pth root of the differences raised to the power p. For Manhattan distance, p is 1, and for Euclidean distance, p is 2. Chebyshev distance is equal to the limit of Minkowski distance as p goes to infinity.
Applying these methods of calculating similarity is straightforward, but a bigger problem is what to use as variables. Raw data usually includes incommensurable numbers. Is five degrees of difference in annual temperature a greater difference or a smaller one than five inches of difference in rainfall? Before using any of these distance calculations, it’s necessary to transform them into a set of variables such that each one represents a roughly equal amount of difference per unit. Sometimes there’s no clear way to do that, and other approaches to measuring similarity may work better.
Sets
A different approach to similarity treats items as sets of elements, rather than as points in a multidimensional space. For instance, someone comparing hotels might enumerate all the features of each one, such as having a restaurant, Wi-Fi, a swimming pool, free parking, etc. The more features two hotels have in common, the more similar they are.
A widely used way to measure this type of similarity is the Jaccard index. The Jaccard index for two items is the number of elements in the intersection of their sets divided by the number of elements in the union of their sets.
For example, suppose Hotel 1 has a restaurant, a swimming pool, and free parking, and Hotel 2 has a restaurant, Wi-Fi, and free parking. There are a total of four different items in the set, and two of them are shared, so the Jaccard index is 0.5, or 50%.
As with difference in multidimensional spaces, it’s necessary to arrange the data so that the differences make sense. The Jaccard index assumes that each item in the sets contributes equally to similarity. Suppose one of the hotels didn’t have beds in its rooms. That would be just one element of difference, but for most guests it would be more important than Wi-Fi, a restaurant, and free parking combined.
Graphs
In some cases, relationships or connections between data points are what’s important. In these cases, a graph (in the sense of a set of elements with connecting paths) may be the best representation. The similarity between items depends on how “direct” a path there is from one to the other. This kind of similarity is a type of topological similarity, where connectedness is an important aspect of the measurement.
Path length is a straightforward example. The distance between two items is the number of connections that have to be traversed to get from one to the other. Degrees of separation and “friend of a friend” relations are examples of path length as a measure of distance.
The Wu-Palmer measure is used to measure semantic similarity. It assumes a “subsuming” relationship, where there are directed paths from a more general node to a more specific one. For example, “rodent” subsumes both “squirrel” and “mouse,” and “mammal” subsumes “rodent.” To calculate the similarity, let d_lcs be the depth of the “least common subsumer,” d1 be the depth of the first item, and d2 be the depth of the second item. Then the measure is
2 * d_lcs / (d1 + d2)
The depth of the root node is 1, so this measure never returns zero or causes a division by zero.
Concluding thoughts
Algorithms to measure similarity are important tools in the development of machine learning and natural language processing. A developer who understands a variety of approaches is in a good position to pick the best one for a particular purpose.
However, blindly applying an algorithm to raw data won’t generally yield a meaningful measure of similarity. It’s necessary to consider various ways of transforming the variables, and some experimentation may be necessary to find the best approach. What matters is how closely the measure approximates similarity as perceived in practice.