A Simple Introduction to Locality Sensitive Hashing (LSH)

Locality sensitive hashing is a super useful trick. Most people use it for near-neighbor search but it’s also helpful for sketching algorithms and high-dimensional data analysis.

This article will introduce the concept of Locality Sensitive Hashing (LSH) and the working principles of the algorithm.

LSH is generally used to find duplicate data in a data space
LSH is generally used to find duplicate data in a data space. Source: Mick

Introduction

Locality Sensitive Hashing (LSH) refers to a set of algorithmic techniques used to speed up the search for neighbours or duplicate data in the samples. LSH can be used to filter out duplicates in a database, scrape web pages or any websites at an impressive speed.

In computer science, LSH can be referred to as a technique that hashes similar input items into the same “buckets” with a high probability.

It differs from conventional hashing techniques in that hash collisions are maximised, not minimised. LSH technique can also be seen to reduce the dimensionality of high-dimensional data while preserving relative distances between items.

How Does LSH Work?

LSH functions are specifically designed so that hash value collisions are more likely for two input values that are close together than for inputs that are far apart. The term “close together” means that the data points that are within a certain distance bounded by a threshold value.

There are multiple hash functions for different use cases as there are multiple different LSH functions for different data types.

LSH algorithm can be implemented in languages such as Java and Python. Below are some of the libraries that can be used to implement LSH:

  1. java-LSH – This library implements Locality Sensitive Hashing (LSH), as described in “Mining of Massive Datasets”. Five different computations can be performed MinHash, Super-Bit, Comparable signatures, Initial seed and Serialization.
  2. LSH – This is a Python implementation of LSH using MinHash to detect near-duplicate text documents.

Hash functions map objects to numbers or bins. The LSH function L(x) tries to map similar objects to the same hash bin and dissimilar objects to different bins.

The key intuition behind LSH is that LSH functions try to group similar elements together into hash bins.

The difference between LSH table and Hash table
The difference between LSH table and Hash table

LSH Hash Collision

A hash collision occurs when two objects x and y have the same hash value. In our example above, all of the red dots collided, but this was not guaranteed to happen.

Under LSH, the term referring to how 2 data points are set to collide or have a chance to collide is “collision probability”. In a more mathematical notation,

To summarise the equation above, given a distance function d(x,y) and a threshold value of R, if the distance is less than the threshold value, there is a good chance (probability > p) of collisions happening.

Different LSH Function

Below explained are the many different LSH functions that are used for different use cases and data types.

Bit Sampling LSH

Bit sampling is one of the simplest and cheapest LSH functions. It is associated with the Hamming distance. Given two n-length vectors x and y, the Hamming distance d(x,y)  is the number of bits that are different between the two vectors.

Bit sampling is an LSH family where:

The collision probability is simply the chance that we picked one of the indices where xi=yi :

Euclidean and Manhattan LSH

The LSH functions for the Euclidean (L2) and Manhattan (L1) distances are based on random projections that separate the data space into multiple pieces.

From the random projections, multiple data pieces can be created
From the random projections, multiple data pieces can be created

Each piece then becomes a hash bin. The hash function is identical to another algorithm called signed random projections (not stated here), but we round instead of taking the sign of the projection.

This produces hash bins that repeat in the direction of w instead of a single decision boundary as with SRP.

r refers to a user-defined parameter that determines each hash bin’s width, b is a random number in the range [0, r]. The analytic expression for the collision probability is complicated to write down.

Clustering LSH

Clustering is most commonly used in LSH as a way to group similar data points together
Clustering is most commonly used in LSH as a way to group similar data points together

Clustering LSH is an example of a learned or data-dependent LSH function. The idea is to find a set of cluster centres that do a good job of approximating the data points in the datasets.

The individual centres might be found using k-means clustering, convex clustering, or any other clustering method. It is easy to show that this function is locality-sensitive but much harder to specify the collision probability since collisions depend on the set of learned cluster centres.

Hierarchical Clustering could also be used in certain use cases, and it is proven that it is much faster than the typical brute-force method.

Applications of LSH

Let us take a look at some of the commonly used applications of LSH.

  • Near Duplication Detection – LSH is commonly used to deduplicate large quantities of documents, webpages, and other files
  • Genome Study – LSH can be used to identify similar gene expressions in genome databases
  • Image and Video Search – Large-scale searching can be deployed using LSH
  • Video Fingerprinting – In multimedia technologies, LSH is widely used as a fingerprinting technique A/V data

There are two ways an LSH can speed things up: by helping you deal with a vast number of points or by helping you deal with points in a high-dimensional space such as image or video data.

One of the popular music streaming applications, Shazam, uses some advanced music information retrieval techniques to achieve this but one can implement such music recognition for fun using Audio Fingerprinting and Locality Sensitive Hashing.

Audio fingerprinting is the process of identifying unique characteristics from a fixed duration audio stream. Such unique characteristics can be identified for all existing songs and stored in a database.

Conclusion

Let us recap what has been introduced in the article. Locality Sensitive Hashing (LSH) refers to a set of algorithmic techniques used to speed up the search for neighbours or duplicate data in the samples.

LSH can be applicable to literally any kind of data and how much it’s relevant in today’s world of big data.

LSH functions are specifically designed so that hash value collisions are more likely for two input values close together than for inputs that are far apart. The fundamental intuition behind LSH is that LSH functions try to group similar elements into hash bins.

Are you looking for ways to get the best out of your data?

If yes, then let us help you use your data.

Your free special webinar guest invitation:
How to avoid the worst big data failures