Introduction
When building self-hosted monitoring systems, analytics pipelines, or distributed databases, you frequently need to answer questions like “have I seen this user before?”, “how many unique visitors today?”, or “what are the top 100 most frequent items?” — but at massive scale where exact answers are too expensive.
This is where probabilistic data structures excel. They trade a tiny amount of accuracy for dramatic reductions in memory usage — often 100-1000x less than exact approaches. Four structures dominate the landscape: Bloom Filters (membership testing), Cuckoo Filters (deletable membership), HyperLogLog (cardinality estimation), and Count-Min Sketch (frequency estimation).
This article compares these four data structures with production-ready Go implementations you can embed in your self-hosted Go services.
Why Probabilistic Structures for Self-Hosted Systems
Self-hosted infrastructure rarely has the resources of cloud-scale deployments. A self-hosted analytics pipeline running on a few Raspberry Pi nodes can’t store every user ID in a hash set with millions of entries. Probabilistic structures let you answer the same questions with kilobytes instead of gigabytes.
| Structure | Operation | Memory (typical) | Error Rate | Supports Deletion? |
|---|---|---|---|---|
| Bloom Filter | “Have I seen X?” | 1.44 × N × log2(1/ε) bits | False positives only | No |
| Cuckoo Filter | “Have I seen X?” | ~(log2(1/ε) + 2) / α bits per item | False positives only | Yes |
| HyperLogLog | “How many unique X?” | ~1.5 KB (standard) | ~2% relative error | Not applicable |
| Count-Min Sketch | “How many times X?” | O(d × w) counters | Over-counts only | Not applicable |
| Exact (HashMap) | All of the above | O(N) — often GBs | 0% | Yes |
Bloom Filters: The Classic Membership Test
A Bloom filter answers one question: “Is item X in the set?” It can say “definitely no” or “probably yes.” This asymmetry makes it perfect for caches, databases, and network filters where false positives are tolerable (check the cache, if “no” we know for sure the data isn’t there) but false negatives are unacceptable.
bits-and-blooms/bloom (2,790⭐, Go)
This is the standard Go Bloom filter, used by InfluxDB, CockroachDB, and many other production systems:
| |
How It Works:
- Allocate a bit array of size M, initialized to 0.
- For each item to add, compute K independent hash functions, each producing an index into the bit array.
- Set all K bits to 1.
- To test membership: if ALL K bits are 1, return “probably present.” If ANY bit is 0, return “definitely absent.”
The false positive rate ε is approximately (1 - e^{-KN/M})^K. With optimal K ≈ 0.693 × M/N, memory needed is M ≈ 1.44 × N × log₂(1/ε) bits.
Pros:
- Extremely memory efficient (1.44N log₂(1/ε) bits)
- O(K) insertion and lookup — constant time
- No false negatives — safe for cache/database negative lookups
- Battle-tested in production (Chrome’s Safe Browsing, Apache HBase, Cassandra)
Cons:
- Cannot delete items (removing bits would affect other items)
- Cannot count frequencies
- Cannot enumerate stored items
- Tunable false positive rate, but lower rate costs more memory
Cuckoo Filters: Bloom’s Deletable Cousin
Cuckoo filters improve on Bloom filters by adding deletion support while using comparable or better memory. They work by storing a “fingerprint” of each item in one of two candidate buckets (the “cuckoo” name comes from cuckoo birds pushing eggs out of nests — when both buckets are full, an existing fingerprint is relocated to its alternate bucket).
seiflotfy/cuckoofilter (1,229⭐, Go)
| |
Pros:
- Supports deletion (key advantage over Bloom)
- Better space efficiency than Bloom for false positive rates < 3%
- O(1) lookup, insert, and delete
- Compact footprint — stores only fingerprints, not full items
Cons:
- Insertion can fail if both candidate buckets are full (rare with proper sizing)
- Must use unique insert (or track insert count separately) since same item can be inserted multiple times
- Slightly more complex implementation
- Less widely deployed than Bloom filters
HyperLogLog: Counting Unique Visitors
HyperLogLog estimates the cardinality (number of distinct elements) of a multiset. It’s used by Redis, Presto, and every major analytics system to answer “how many unique users visited today?” in constant memory regardless of the actual count.
HyperLogLog uses ~1.5 KB to count up to 10^9 distinct items with ~2% standard error. Compare this to a HashSet which would need gigabytes for the same task.
| |
Pros:
- Constant memory regardless of cardinality
- Mergeable: combine two HLL sketches by taking max of each register pair
- Used in production by Redis PFADD/PFCOUNT, Google BigQuery, Presto
- ~2% standard error with standard 12 KB configuration
Cons:
- Cannot enumerate members
- Cannot delete items
- Approximate only — exact counts require a different approach
- Small cardinalities (< few thousand) have higher relative error
Count-Min Sketch: Tracking Frequencies
Count-Min Sketch estimates how many times each item has been seen. It’s used in network monitoring (top talkers), database query planning, and real-time analytics dashboards.
Like a Bloom filter, Count-Min Sketch uses multiple hash functions — but instead of setting bits, it increments counters in a 2D grid. The estimated frequency for an item is the minimum of all its counters (hence “Count-Min”).
| |
Pros:
- Sub-linear memory in number of distinct items
- Mergeable (add two CMS by summing counters — usable in distributed monitoring)
- Only over-counts (never under-counts) — safe for rate limiting and quota enforcement
- Works with streaming data — no need to store all items
Cons:
- Over-counts due to hash collisions (tunable error bound)
- Error proportional to total count in the sketch (sum of all additions)
- Cannot enumerate tracked items
- No deletion support
Choosing the Right Structure
| Question | Best Tool | Typical Memory |
|---|---|---|
| “Is X in my set?” (no deletions needed) | Bloom Filter | ~1 KB per 1K items at 1% FP |
| “Is X in my set?” (need to delete) | Cuckoo Filter | ~1 KB per 1K items at 1% FP |
| “How many unique X today?” | HyperLogLog | ~12 KB regardless of count |
| “How many times has X appeared?” | Count-Min Sketch | ~80 KB for millions of items |
| “What are the top-10 most frequent X?” | Count-Min Sketch + Heap | ~100 KB |
For related reading on data infrastructure, see our distributed SQLite databases guide and our code search tools comparison. For monitoring context, check our distributed tracing guide.
FAQ
Can I use Bloom filters in a self-hosted database like PostgreSQL?
Yes! PostgreSQL has the bloom extension for index access methods. A Bloom filter index on multiple columns can dramatically speed up queries with AND conditions by quickly eliminating rows that can’t match. Enable it with CREATE EXTENSION bloom; then CREATE INDEX ON table USING bloom (col1, col2, col3);.
What’s the main reason to choose Cuckoo over Bloom?
Deletion support. If you’re tracking active sessions or temporary tokens where items expire and need to be removed, Bloom filters accumulate false positives over time (bits stay set forever). Cuckoo filters let you clean up, maintaining accuracy. For permanent membership tracking (like blocklists), Bloom is simpler and equally effective.
How accurate is HyperLogLog in practice?
With standard 14-bit precision (16,384 registers, ~12 KB), HyperLogLog achieves ~0.8% relative error at millions of distinct items. At thousands, you can use HyperLogLog++ improvements (bias correction, sparse representation at low cardinalities) for sub-1% error across all scales. Redis implements HyperLogLog++ and achieves <1% error in practice.
Can I merge sketches across multiple self-hosted servers?
Yes! HyperLogLog merges by taking the per-register maximum. Count-Min Sketch merges by summing per-counter values. Bloom filters merge by OR-ing bit arrays. This makes them ideal for distributed monitoring — each server maintains its own sketch, then a central aggregator merges them for global estimates.
When should I NOT use probabilistic data structures?
When exact answers are required (financial calculations, security decisions), when the dataset is small enough for exact structures (< 1 million items), or when you need to enumerate or iterate over stored items. For example, don’t use a Bloom filter to store user IDs for an access control list — a false positive means granting access to an unauthorized user.
💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com