HyperLogLog in Java/Redis: Estimating Unique Counts Efficiently

In this lecture, we will explore HyperLogLog, a probabilistic data structure that is highly useful for counting unique elements in high-volume datasets without consuming a lot of memory.

1. Problem Scenario

Imagine you are running a high-traffic application with multiple microservices.

Some of the common business requirements might be:

Count unique IP addresses visiting your application.
Count unique users interacting with your products.
Count unique events per day, like product views or transactions.

For business decisions, exact numbers are not always necessary. What is often required is a close estimation of the unique count.

Why Not Use a Set?

The straightforward approach is to use a Set:

Set<String> uniqueUsers = new HashSet<>();
uniqueUsers.add(userId);
int uniqueCount = uniqueUsers.size();

Problems with sets at scale:

Each item is stored in memory.
For high-volume traffic (millions of unique users), memory usage grows significantly.
Adding new items requires checking for duplicates every time.

2. Introducing HyperLogLog

HyperLogLog is a probabilistic data structure designed for estimating cardinality (the number of unique elements).

Key characteristics:

Does not store actual items.
Provides an approximate unique count with high accuracy (~1% error rate).
Uses very little memory, even for extremely large datasets.

Example Use Case

Suppose you have 100,000 unique users:

Using a Set, you must store all 100,000 items in memory.
Using HyperLogLog, you store a compact representation, and it can still estimate the count as ~99,562. This is close enough for business analytics.

3. HyperLogLog in Java with Redis

Setup

Create a HyperLogLog object using Redisson or Redis commands.
Add elements to it as they arrive (like user IDs or IP addresses).

RHyperLogLog<Long> userVisits = redisson.getHyperLogLog("userVisits");

Adding Items

List<Long> userBatch = LongStream.rangeClosed(1, 25)
        .boxed()
        .collect(Collectors.toList());

userVisits.addAll(userBatch);

You can continue adding batches of elements multiple times.
HyperLogLog automatically updates its internal estimation without storing actual items.

Getting the Count

long estimatedUnique = userVisits.count();
System.out.println("Estimated unique count: " + estimatedUnique);

For small datasets, the estimate may be exact.
For large datasets, there may be a minor discrepancy (around 1% error).

4. Simulating High-Volume Data

Imagine multiple large lists of user IDs:

List<Long> list1 = LongStream.rangeClosed(1, 23000).boxed().toList();
List<Long> list2 = LongStream.rangeClosed(23001, 50000).boxed().toList();
List<Long> list3 = LongStream.rangeClosed(50001, 75000).boxed().toList();
List<Long> list4 = LongStream.rangeClosed(75001, 100000).boxed().toList();

Adding all lists to HyperLogLog:

userVisits.addAll(list1);
userVisits.addAll(list2);
userVisits.addAll(list3);
userVisits.addAll(list4);

Getting the estimated count:

long estimatedTotal = userVisits.count();
System.out.println("Estimated total unique users: " + estimatedTotal);

Example Output:

Estimated total unique users: 99,562

Actual unique users: 100,000
Difference: 438
This small error is acceptable for business-level analysis.

5. Advantages of HyperLogLog

Memory Efficient
- Stores only a compact representation of the unique items.
- Memory usage is independent of the number of elements added.
Fast
- Adding items and counting is computationally inexpensive.
Scalable
- Can handle millions of unique items without growing memory footprint.
Probabilistic Estimation
- Accurate within ~1% error rate.

6. When to Use HyperLogLog

Counting unique visitors to a website.
Estimating unique IPs or sessions in analytics.
Aggregating unique events across high-volume microservices.
Scenarios where exact precision is not critical, but memory efficiency is important.

7. Summary

HyperLogLog is a probabilistic data structure that provides an approximate unique count.
Ideal for high-volume datasets where memory constraints make exact counting infeasible.
Easy to integrate with Redis via Redisson or Redis commands.
Provides a memory-efficient, fast, and scalable solution for unique count estimation.