In this lecture, we will explore HyperLogLog, a probabilistic data structure that is highly useful for counting unique elements in high-volume datasets without consuming a lot of memory.
1. Problem Scenario
Imagine you are running a high-traffic application with multiple microservices.
Some of the common business requirements might be:
- Count unique IP addresses visiting your application.
- Count unique users interacting with your products.
- Count unique events per day, like product views or transactions.
For business decisions, exact numbers are not always necessary. What is often required is a close estimation of the unique count.
Why Not Use a Set?
The straightforward approach is to use a Set:
Set<String> uniqueUsers = new HashSet<>(); uniqueUsers.add(userId); int uniqueCount = uniqueUsers.size();
Problems with sets at scale:
- Each item is stored in memory.
- For high-volume traffic (millions of unique users), memory usage grows significantly.
- Adding new items requires checking for duplicates every time.
2. Introducing HyperLogLog
HyperLogLog is a probabilistic data structure designed for estimating cardinality (the number of unique elements).
Key characteristics:
- Does not store actual items.
- Provides an approximate unique count with high accuracy (~1% error rate).
- Uses very little memory, even for extremely large datasets.
Example Use Case
Suppose you have 100,000 unique users:
- Using a
Set, you must store all 100,000 items in memory. - Using HyperLogLog, you store a compact representation, and it can still estimate the count as ~99,562. This is close enough for business analytics.
3. HyperLogLog in Java with Redis
Setup
- Create a HyperLogLog object using Redisson or Redis commands.
- Add elements to it as they arrive (like user IDs or IP addresses).
RHyperLogLog<Long> userVisits = redisson.getHyperLogLog("userVisits");
Adding Items
List<Long> userBatch = LongStream.rangeClosed(1, 25)
.boxed()
.collect(Collectors.toList());
userVisits.addAll(userBatch);
- You can continue adding batches of elements multiple times.
- HyperLogLog automatically updates its internal estimation without storing actual items.
Getting the Count
long estimatedUnique = userVisits.count();
System.out.println("Estimated unique count: " + estimatedUnique);
- For small datasets, the estimate may be exact.
- For large datasets, there may be a minor discrepancy (around 1% error).
4. Simulating High-Volume Data
Imagine multiple large lists of user IDs:
List<Long> list1 = LongStream.rangeClosed(1, 23000).boxed().toList(); List<Long> list2 = LongStream.rangeClosed(23001, 50000).boxed().toList(); List<Long> list3 = LongStream.rangeClosed(50001, 75000).boxed().toList(); List<Long> list4 = LongStream.rangeClosed(75001, 100000).boxed().toList();
Adding all lists to HyperLogLog:
userVisits.addAll(list1); userVisits.addAll(list2); userVisits.addAll(list3); userVisits.addAll(list4);
Getting the estimated count:
long estimatedTotal = userVisits.count();
System.out.println("Estimated total unique users: " + estimatedTotal);
Example Output:
Estimated total unique users: 99,562
- Actual unique users: 100,000
- Difference: 438
- This small error is acceptable for business-level analysis.
5. Advantages of HyperLogLog
- Memory Efficient
- Stores only a compact representation of the unique items.
- Memory usage is independent of the number of elements added.
- Fast
- Adding items and counting is computationally inexpensive.
- Scalable
- Can handle millions of unique items without growing memory footprint.
- Probabilistic Estimation
- Accurate within ~1% error rate.
6. When to Use HyperLogLog
- Counting unique visitors to a website.
- Estimating unique IPs or sessions in analytics.
- Aggregating unique events across high-volume microservices.
- Scenarios where exact precision is not critical, but memory efficiency is important.
7. Summary
- HyperLogLog is a probabilistic data structure that provides an approximate unique count.
- Ideal for high-volume datasets where memory constraints make exact counting infeasible.
- Easy to integrate with Redis via Redisson or Redis commands.
- Provides a memory-efficient, fast, and scalable solution for unique count estimation.
