When your infrastructure generates terabytes of logs daily, storing and searching every log line becomes prohibitively expensive. Log sampling - the practice of intelligently selecting a representative subset of log data - lets you maintain observability while controlling storage costs.
This guide compares three self-hosted tools for log sampling and reduction: Vector (log sampling pipelines), Fluent Bit (lightweight log processor with sampling filters), and Grafana Loki (log aggregation with built-in sampling at ingestion).
Overview
| Feature | Vector | Fluent Bit | Grafana Loki |
|---|---|---|---|
| GitHub Stars | 20,000+ | 17,000+ | 25,000+ |
| Language | Rust | C | Go |
| Memory Usage | Low | Very Low | Moderate |
| Sampling Methods | Throttle, sample, reduce | Sample, throttle, grep | Structured metadata sampling |
| Log Transformation | Full VRL support | Lua/regex | Via pipeline stages |
| Log Aggregation | No (shipper only) | No (shipper only) | Full backend |
| Rate Limiting | Built-in | Built-in | Via ingestion limits |
| Log Deduplication | Yes | Limited | Via indexed fields |
| Output Targets | 80+ destinations | 60+ destinations | Grafana only |
| Configuration | TOML | YAML/INI | YAML (loki-config) |
| Kubernetes Support | DaemonSet/Sidecar | DaemonSet | Read/Write path |
| Prometheus Metrics | Native | Native | Native |
Vector: High-Performance Log Sampling Pipelines
Vector is a high-performance observability data pipeline written in Rust. Its sampling transforms allow you to reduce log volume while preserving important events.
Sampling Configuration
| |
VRL-Based Conditional Sampling
Vector Remap Language (VRL) enables sophisticated sampling logic:
| |
Docker Compose for Vector
| |
Fluent Bit: Lightweight Log Sampling
Fluent Bit is a lightweight log processor from the CNCF. Its filter plugins provide sampling, throttling, and log reduction with minimal resource overhead.
Sampling Configuration
| |
Lua-Based Advanced Sampling
| |
Docker Compose for Fluent Bit
| |
Grafana Loki: Log Aggregation with Sampling
Grafana Loki is a horizontally scalable log aggregation system. It supports sampling through ingestion rate limits, structured metadata, and collector-side pipelines.
Loki Sampling Configuration
| |
Promtail with Sampling Pipeline
| |
Docker Compose for Loki Stack
| |
Sampling Strategies Comparison
| Strategy | Best Tool | Use Case |
|---|---|---|
| Random sampling (N percent) | Vector | General log volume reduction |
| Rate-based throttling | Fluent Bit | Prevent log storms from noisy services |
| Level-based filtering | All three | Drop DEBUG in production |
| Hash-based deterministic | Fluent Bit (Lua) | Consistent sampling for debugging |
| Deduplication | Vector (reduce transform) | Eliminate repeated identical logs |
| Error-first sampling | Vector (VRL) | Always keep errors, sample everything else |
| Structured metadata | Loki | Sample based on labels/streams |
| Ingestion rate limiting | Loki | Backend-side sampling enforcement |
Why Self-Host Log Sampling Infrastructure?
When logs contain sensitive data - credentials, PII, internal service names - sending them through a third-party log management service creates compliance risks. Self-hosted log sampling keeps your data within your infrastructure while still achieving the cost savings of reduced log volume.
For the broader logging ecosystem, see our syslog aggregation guide and log parsing comparison. For centralized journal collection, our systemd journal remote guide covers aggregation patterns that complement sampling.
Self-hosted log sampling delivers:
- GDPR/SOC2 compliance - Sensitive log data never leaves your network
- Cost predictability - Control storage costs by capping ingestion rates
- Performance isolation - Log storms do not impact your observability backend
- Custom sampling logic - Implement domain-specific rules (e.g., always keep payment-related logs)
- Retention optimization - Sampled logs require less storage, enabling longer retention windows
FAQ
What is log sampling and when should I use it?
Log sampling is the practice of processing only a subset of generated log entries - either randomly, by rate limits, or through intelligent filtering. Use it when log volume exceeds your storage budget, when noisy services generate excessive low-value logs, or when you need to prevent log storms from overwhelming your observability backend.
Is it safe to sample logs? Will I miss important events?
A well-designed sampling strategy always preserves ERROR and CRITICAL logs while sampling lower-severity entries. Deterministic sampling (hash-based) ensures that the same request is consistently sampled, making debugging possible. Random sampling should only be applied to high-volume, low-value log levels like DEBUG and INFO.
What is the difference between log sampling and log filtering?
Filtering removes logs based on specific criteria (e.g., drop all DEBUG logs). Sampling reduces log volume statistically (e.g., keep 10 percent of INFO logs). Sampling is useful when you need a representative view of traffic patterns; filtering is used when you know certain log types are never needed.
Can I combine multiple sampling strategies?
Yes. A production setup typically layers: (1) level-based filtering to drop DEBUG, (2) rate-based throttling to cap per-service log rates, (3) random sampling for remaining INFO logs, and (4) deduplication to collapse identical consecutive messages.
How does Loki handle sampling differently from Vector and Fluent Bit?
Loki applies sampling at the ingestion layer through rate limits (per-stream and global). Vector and Fluent Bit sample before shipping, at the collector level. The best practice is to use both: collector-side sampling reduces network transfer, while Loki-side rate limits provide a safety net against collector misconfiguration.
What sampling rate should I use for production?
Start with: 100 percent for ERROR and CRITICAL, 10-25 percent for WARN, 1-5 percent for INFO, and 0 percent for DEBUG (drop entirely). Adjust based on your storage budget and query patterns. Monitor sampling effectiveness by tracking the ratio of dropped-to-kept logs.