Introduction
Understanding CPU performance at the hardware level is essential for self-hosted infrastructure optimization. Modern x86 and ARM processors expose hundreds of Performance Monitoring Unit (PMU) counters — hardware registers that track cache misses, branch mispredictions, instruction throughput, and memory bandwidth. Three open-source tools dominate the Linux performance counter landscape: the kernel’s built-in perf subsystem, the HPC-grade LIKWID (Like I Knew What I’m Doing) suite, and Intel’s pmu-tools collection.
This guide compares all three across setup complexity, metric coverage, visualization capabilities, and integration with self-hosted monitoring stacks. Whether you’re debugging a database performance regression or benchmarking a new server, the right tool choice dramatically reduces time to insight.
Comparison Table
| Feature | perf | LIKWID | pmu-tools |
|---|---|---|---|
| Installation | Built into kernel | apt install likwid | Git clone + Python |
| Stars | Kernel: 235K+ | 1,907 | 2,229 |
| Last Update | Continuous | June 2026 | April 2026 |
| Counters Exposed | 200+ per CPU | 200+ per CPU | 200+ per CPU |
| Sampling Support | Yes (perf record) | Limited | Via perf underneath |
| Top-Down Analysis | Yes (perf stat) | Yes (likwid-perfctr) | Yes (toplev) |
| Uncore/RAPL | Basic | Extensive | Extensive |
| GUI/Web UI | Via perf.data | likwid-web | Grafana via toplev |
| Container Support | Full | Requires host access | Via perf events |
| MPI/HPC Aware | No | Yes (native) | No |
| Learning Curve | Moderate | Steep | Moderate |
perf: The Kernel’s Swiss Army Knife
The perf subsystem ships with every Linux kernel and provides the broadest compatibility across CPU architectures. It operates through the perf_event_open() syscall and exposes counters, tracepoints, kprobes, and uprobes from a single CLI.
Installation & Basic Usage
| |
Sampling with perf record
For production analysis, sampling mode captures call stacks with minimal overhead (typically 1-3%):
| |
Top-Down Analysis (Intel only)
Intel’s Top-Down Microarchitecture Analysis identifies performance bottlenecks by category:
| |
LIKWID: HPC-Grade Precision
LIKWID provides deterministic counter measurements unaffected by kernel scheduling — critical for reproducible benchmarks. Its core differentiator is pinning measurements to specific CPU cores and masking interrupts during measurement windows.
Docker Compose Setup
| |
Topology Discovery
| |
Precision Measurements
LIKWID’s killer feature is likwid-perfctr which pins to specific cores:
| |
LIKWID Marker API (C/C++)
| |
pmu-tools: Intel Deep Dive
Andy Kleen’s pmu-tools bridges perf and Intel-specific PMU features. Its standout tool is toplev — a pipeline bottleneck analyzer that maps hundreds of PMU events to high-level performance categories.
Installation
| |
toplev: Pipeline Bottleneck Analysis
| |
OCD: Optimized Call-graph Decoding
| |
Integration with Self-Hosted Monitoring
| |
Choosing the Right Tool
| Use Case | Best Tool | Why |
|---|---|---|
| Quick CPU overview on any Linux server | perf | Zero install, universal compatibility |
| HPC cluster benchmarking | LIKWID | Per-core pinning, MPI-aware, reproducible |
| Intel microarchitecture optimization | pmu-tools (toplev) | Top-down analysis, event-level detail |
| Container-native monitoring | perf | Works inside containers with appropriate permissions |
| Memory bandwidth analysis | LIKWID | MEM group provides all DRAM metrics |
| Production flame graphs | perf | perf record + perf script pipeline |
| Long-term trend monitoring | pmu-tools + Grafana | toplev CSV output feeds dashboards |
Why Self-Host Your Performance Monitoring?
Running performance monitoring on your own infrastructure provides several critical advantages. First, you own the data — CPU telemetry never leaves your network, which is essential for security-conscious deployments in finance, healthcare, and defense. Second, self-hosted tools can be tuned to your specific hardware mix rather than relying on cloud vendor abstractions that hide PMU detail layers.
Cost control is another factor. Cloud monitoring services charge per metric and per gigabyte of ingestion — a single server generating 200+ PMU counters at 1-second intervals can cost hundreds of dollars per month. Self-hosted perf, LIKWID, and pmu-tools generate the same data for free. You only pay for the Grafana instance visualizing it.
For HPC environments running MPI jobs across hundreds of nodes, LIKWID’s per-core pinning and cluster-wide aggregation capabilities have no cloud equivalent. The tool was designed by the Erlangen Regional Computing Center specifically for scientific computing workloads.
If you’re exploring related performance topics, check our guide on Linux CPU Scheduler Analysis for scheduling latency insights. Our Kernel Dynamic Tracing comparison covers perf-probe and dynamic tracepoints. For I/O performance, see our Block I/O Latency Tracing guide.
FAQ
Do these tools work on AMD CPUs?
Yes, with caveats. perf works fully on AMD Zen architectures and exposes AMD-specific PMU events. LIKWID has supported AMD since 2019 (EPYC and Ryzen), though some Intel-specific performance groups are unavailable. pmu-tools’ toplev is Intel-only — AMD users should use perf stat --topdown on Zen 4+ processors instead.
What kernel permissions are needed?
All three tools require access to the perf_event_open() syscall. Set kernel.perf_event_paranoid=-1 for full access, or 0 to allow unprivileged users to measure their own processes. LIKWID additionally needs /dev/cpu/*/msr access (MSR module loaded) for certain metrics like RAPL energy counters.
Can I monitor containers with these tools?
perf works inside containers with CAP_PERFMON or CAP_SYS_ADMIN. LIKWID requires host-level access since it programs MSRs directly. pmu-tools can monitor containerized workloads from the host by targeting the cgroup or the PID namespace of the container process.
What’s the performance overhead?
Sampling mode (perf record -F 99) typically adds 1-3% overhead. Counting mode (perf stat) is near-zero (<0.1%). LIKWID’s likwid-perfctr is designed for zero-overhead during measurement windows by programming counters before the workload and reading after. pmu-tools’ toplev uses multiplexing and may add 2-5% overhead depending on event count.
How do I export metrics to Prometheus or Grafana?
Use the node_exporter textfile collector with toplev output piped to .prom files. For perf, use perf stat --json (kernel 5.10+) and parse with a Python script to Prometheus format. LIKWID outputs CSV natively via -O csv which can be ingested by Telegraf’s CSV input plugin.
Can I use these in CI/CD pipelines?
Yes. Run perf stat or toplev.py before and after code changes to detect performance regressions. LIKWID is ideal for CI benchmarking due to its reproducible, low-variance measurements. All three tools can be scripted and their outputs parsed for automated threshold alerts.
💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com