Learnitweb

MinHash

Introduction to MinHash

MinHash, short for “Min-wise Independent Permutations Hashing”, is a probabilistic algorithm used to estimate the similarity between two sets. Unlike exact similarity measures, which require comparing every element in both sets, MinHash provides a fast and memory-efficient way to approximate similarity, making it extremely useful for large-scale datasets.

MinHash is primarily used to estimate Jaccard similarity, which measures the overlap between two sets:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|
  • A ∩ B → Number of elements common to both sets.
  • A ∪ B → Total number of distinct elements across both sets.

Computing this exactly for large sets is time-consuming. MinHash allows us to estimate the Jaccard similarity without storing or comparing all elements, by using hash functions and sketches.

MinHash is widely used in:

  • Document similarity detection (e.g., finding near-duplicate web pages or plagiarism detection).
  • Search engines (to cluster similar web pages or detect spam).
  • Recommendation systems (to compare user behaviors or item similarities).
  • Big data analytics (where exact computation of similarities is expensive).

How MinHash Works

MinHash works by generating signatures (small sketches) for each set, so that the probability of matching signatures reflects the Jaccard similarity.

Step 1: Represent the Sets

  • Suppose we have two sets, A and B.
  • Example:
A = {apple, banana, cherry, date}
B = {banana, cherry, date, fig, grape}
  • Exact Jaccard similarity:
|A ∩ B| = 3   (banana, cherry, date)
|A ∪ B| = 7   (apple, banana, cherry, date, fig, grape)
J(A,B) = 3/7 ≈ 0.428
  • MinHash will estimate this efficiently using hash functions.

Step 2: Apply Multiple Hash Functions

  • Use k different hash functions (h1, h2, ..., hk) on the elements of the sets.
  • Each hash function maps an element to an integer.
  • For each set, compute the minimum hash value for each hash function.
  • This minimum value is called the MinHash signature for that hash function.

Example with h1:

h1(apple) = 15
h1(banana) = 10
h1(cherry) = 20
h1(date) = 5
Minimum for set A = 5

Repeat for all hash functions to generate a signature vector for each set.


Step 3: Construct MinHash Signatures

  • Suppose we use 3 hash functions (h1, h2, h3), then the signature vectors are:
Signature(A) = [min_h1(A), min_h2(A), min_h3(A)]
Signature(B) = [min_h1(B), min_h2(B), min_h3(B)]
  • Each entry is the minimum hash value of the elements in the set under that hash function.

Step 4: Estimate Jaccard Similarity

  • The similarity estimate is the fraction of hash functions for which the minimum hash values of both sets are equal:
Estimated J(A,B) = (# of matching minhashes) / (total # of hash functions)
  • This works because the probability that the minimum hash value is the same for two sets equals the Jaccard similarity.

Example:

  • Signature(A) = [5, 12, 7]
  • Signature(B) = [5, 10, 7]
  • Matching positions: 2 out of 3 → Estimated Jaccard ≈ 2/3 ≈ 0.667
  • Increasing the number of hash functions k improves accuracy.

Step 5: Locality Sensitive Hashing (Optional)

  • MinHash is often combined with Locality Sensitive Hashing (LSH) to efficiently find similar sets in large datasets.
  • LSH bins similar MinHash signatures together so that similar sets are more likely to be in the same bucket, reducing pairwise comparisons.

Example Workflow

Suppose we have three documents:

Doc1: {apple, banana, cherry}
Doc2: {banana, cherry, date}
Doc3: {apple, fig, grape}

Steps:

  1. Choose 5 hash functions (h1–h5).
  2. Compute minimum hash values for each document under each hash function.
  3. Create signature vectors:
Doc1_signature = [5, 10, 7, 12, 3]
Doc2_signature = [5, 11, 7, 15, 2]
Doc3_signature = [6, 20, 8, 18, 3]
  1. Estimate pairwise similarities:
  • Doc1 & Doc2: 2/5 matching → 0.4
  • Doc1 & Doc3: 1/5 matching → 0.2
  • Doc2 & Doc3: 0/5 matching → 0.0

This provides a quick similarity estimate without comparing every element.


Advantages of MinHash

  1. Memory Efficient
    • Only stores signature vectors instead of full sets.
    • Greatly reduces memory usage for large datasets.
  2. Fast Similarity Estimation
    • Avoids expensive exact set comparisons.
    • Useful for high-throughput systems like search engines.
  3. Probabilistic Guarantees
    • The expected value of MinHash similarity equals the true Jaccard similarity.
  4. Scalable
    • Works well for millions of documents or sets.
    • Can be combined with LSH for efficient nearest-neighbor search.
  5. Simple and Flexible
    • Easy to implement using standard hash functions.
    • Signature size (k) can be adjusted for desired accuracy.

Limitations

  1. Approximate
    • MinHash provides an estimate, not exact similarity.
    • Accuracy depends on number of hash functions.
  2. Requires Hash Functions
    • Needs multiple independent hash functions.
    • Poorly chosen hash functions can reduce accuracy.
  3. Best for Jaccard Similarity
    • MinHash directly estimates Jaccard similarity, not other metrics like Cosine or Dice similarity.
  4. Signature Storage
    • While small compared to full sets, storing signature vectors for millions of sets may still be significant.

Practical Applications

  1. Document Deduplication
    • Identify near-duplicate web pages, articles, or research papers.
    • Reduces storage and indexing overhead.
  2. Plagiarism Detection
    • Compare student assignments, code, or academic papers for similarity.
  3. Search Engines
    • Cluster similar web pages.
    • Reduce redundant search results.
  4. Recommendation Systems
    • Compare user behaviors or item attributes for similarity-based recommendations.
  5. Data Cleaning
    • Detect duplicates or near-duplicates in large databases.

Java Example Using MinHash

Libraries like MinHash LSH in Datasketches or stream-lib provide implementations. Here’s a conceptual example:

import org.apache.datasketches.minhash.*;

public class MinHashExample {
    public static void main(String[] args) {
        int numHashFunctions = 128;

        // Create MinHash instances for two sets
        MinHash mh1 = new MinHash(numHashFunctions);
        MinHash mh2 = new MinHash(numHashFunctions);

        // Add elements to sets
        String[] set1 = {"apple", "banana", "cherry"};
        String[] set2 = {"banana", "cherry", "date"};

        for (String s : set1) mh1.update(s.getBytes());
        for (String s : set2) mh2.update(s.getBytes());

        // Estimate Jaccard similarity
        double similarity = mh1.jaccard(mh2);
        System.out.println("Estimated Jaccard similarity: " + similarity);
    }
}

Explanation:

  • MinHash(numHashFunctions) initializes a MinHash object with k hash functions.
  • update() adds elements to the MinHash signature.
  • jaccard() estimates similarity between two signatures.

Summary

  • MinHash is a probabilistic algorithm for estimating Jaccard similarity between sets.
  • Uses hash functions to create compact signatures.
  • The fraction of matching minhash values approximates the similarity.
  • Advantages: memory-efficient, fast, scalable, and simple.
  • Limitations: approximate, requires multiple hash functions, only for Jaccard similarity.
  • Widely used in document deduplication, search engines, recommendation systems, and data cleaning.