Learnitweb

HyperLogLog in Java/Redis: Estimating Unique Counts Efficiently

In this lecture, we will explore HyperLogLog, a probabilistic data structure that is highly useful for counting unique elements in high-volume datasets without consuming a lot of memory.


1. Problem Scenario

Imagine you are running a high-traffic application with multiple microservices.

Some of the common business requirements might be:

  • Count unique IP addresses visiting your application.
  • Count unique users interacting with your products.
  • Count unique events per day, like product views or transactions.

For business decisions, exact numbers are not always necessary. What is often required is a close estimation of the unique count.


Why Not Use a Set?

The straightforward approach is to use a Set:

Set<String> uniqueUsers = new HashSet<>();
uniqueUsers.add(userId);
int uniqueCount = uniqueUsers.size();

Problems with sets at scale:

  • Each item is stored in memory.
  • For high-volume traffic (millions of unique users), memory usage grows significantly.
  • Adding new items requires checking for duplicates every time.

2. Introducing HyperLogLog

HyperLogLog is a probabilistic data structure designed for estimating cardinality (the number of unique elements).

Key characteristics:

  • Does not store actual items.
  • Provides an approximate unique count with high accuracy (~1% error rate).
  • Uses very little memory, even for extremely large datasets.

Example Use Case

Suppose you have 100,000 unique users:

  • Using a Set, you must store all 100,000 items in memory.
  • Using HyperLogLog, you store a compact representation, and it can still estimate the count as ~99,562. This is close enough for business analytics.

3. HyperLogLog in Java with Redis

Setup

  • Create a HyperLogLog object using Redisson or Redis commands.
  • Add elements to it as they arrive (like user IDs or IP addresses).
RHyperLogLog<Long> userVisits = redisson.getHyperLogLog("userVisits");

Adding Items

List<Long> userBatch = LongStream.rangeClosed(1, 25)
        .boxed()
        .collect(Collectors.toList());

userVisits.addAll(userBatch);
  • You can continue adding batches of elements multiple times.
  • HyperLogLog automatically updates its internal estimation without storing actual items.

Getting the Count

long estimatedUnique = userVisits.count();
System.out.println("Estimated unique count: " + estimatedUnique);
  • For small datasets, the estimate may be exact.
  • For large datasets, there may be a minor discrepancy (around 1% error).

4. Simulating High-Volume Data

Imagine multiple large lists of user IDs:

List<Long> list1 = LongStream.rangeClosed(1, 23000).boxed().toList();
List<Long> list2 = LongStream.rangeClosed(23001, 50000).boxed().toList();
List<Long> list3 = LongStream.rangeClosed(50001, 75000).boxed().toList();
List<Long> list4 = LongStream.rangeClosed(75001, 100000).boxed().toList();

Adding all lists to HyperLogLog:

userVisits.addAll(list1);
userVisits.addAll(list2);
userVisits.addAll(list3);
userVisits.addAll(list4);

Getting the estimated count:

long estimatedTotal = userVisits.count();
System.out.println("Estimated total unique users: " + estimatedTotal);

Example Output:

Estimated total unique users: 99,562
  • Actual unique users: 100,000
  • Difference: 438
  • This small error is acceptable for business-level analysis.

5. Advantages of HyperLogLog

  1. Memory Efficient
    • Stores only a compact representation of the unique items.
    • Memory usage is independent of the number of elements added.
  2. Fast
    • Adding items and counting is computationally inexpensive.
  3. Scalable
    • Can handle millions of unique items without growing memory footprint.
  4. Probabilistic Estimation
    • Accurate within ~1% error rate.

6. When to Use HyperLogLog

  • Counting unique visitors to a website.
  • Estimating unique IPs or sessions in analytics.
  • Aggregating unique events across high-volume microservices.
  • Scenarios where exact precision is not critical, but memory efficiency is important.

7. Summary

  • HyperLogLog is a probabilistic data structure that provides an approximate unique count.
  • Ideal for high-volume datasets where memory constraints make exact counting infeasible.
  • Easy to integrate with Redis via Redisson or Redis commands.
  • Provides a memory-efficient, fast, and scalable solution for unique count estimation.