Introduction
Performance regressions are among the most painful bugs to diagnose. A function that was fast yesterday becomes slow today, and nobody knows why. Without systematic benchmarking integrated into your development workflow, performance degradation creeps in silently — one pull request at a time — until your application feels sluggish and your users notice before you do.
Python’s benchmarking ecosystem provides several tools for measuring, tracking, and preventing performance regressions. This guide compares four leading solutions: pytest-benchmark (the pytest-integrated standard), CodSpeed (CI-native performance tracking), pyperf (the PSF’s statistical benchmark toolkit), and airspeed-velocity (for tracking performance across git commits).
Comparison Table
| Feature | pytest-benchmark | CodSpeed | pyperf | airspeed-velocity |
|---|---|---|---|---|
| Type | pytest plugin | CI platform + pytest plugin | Standalone toolkit | CLI for git bisection |
| Statistical Rigor | Basic (min/max/mean) | Advanced (with CI integration) | Advanced (outlier detection, warmup) | Basic (timing comparison) |
| CI Integration | Native (pytest) | Native (GitHub Actions) | Manual (scripts) | Manual (git bisect) |
| Historical Tracking | No (single run) | Yes (dashboard) | No (single run) | Yes (across commits) |
| GitHub Stars | ~1,500 | ~300 | ~1,000 | ~900 |
| JSON Output | Yes | Yes | Yes | Yes |
| Calibration | No | Yes | Yes (CPU calibration) | No |
| Web UI | No | Yes (dashboard) | No | Limited (ASV web) |
| Best For | Unit-benchmarking in CI | Performance regression prevention | Scientific benchmarking | Git-based regression hunting |
pytest-benchmark: The Pytest-Native Standard
pytest-benchmark is the most widely adopted Python benchmarking tool. It integrates directly into pytest, allowing you to write benchmarks alongside your tests.
Installation:
| |
Basic Usage:
| |
Running benchmarks:
| |
pytest-benchmark outputs a detailed comparison table with min, max, mean, median, interquartile range, and standard deviation. It automatically detects outliers and marks them in the output, making it easy to spot noisy benchmarks.
| |
CodSpeed: CI-Native Performance Tracking
CodSpeed takes a different approach: rather than a one-shot benchmarking tool, it’s a continuous performance tracking platform that integrates with GitHub Actions. Every PR automatically runs benchmarks against the main branch, and CodSpeed reports whether the PR introduced performance regressions.
Installation:
| |
Setup (GitHub Actions):
| |
Writing CodSpeed benchmarks:
| |
CodSpeed’s key advantage is its calibration system — it measures your CI runner’s baseline performance and normalizes results, eliminating noise from shared CI infrastructure. It provides a web dashboard showing performance trends over time, PR-level regression detection, and per-function performance profiles.
pyperf: The PSF’s Statistical Toolkit
pyperf is the Python Software Foundation’s benchmarking toolkit, designed for statistical rigor. It handles CPU calibration, process isolation, warmup rounds, and outlier detection, making it ideal for precise benchmarking where measurement noise could mask real differences.
Installation:
| |
Writing Benchmarks:
| |
Running:
| |
pyperf’s output includes calibrated timings, outlier detection, and statistical significance testing. It can detect differences as small as 1-2% between benchmark runs with high confidence. For scientific benchmarking where correctness matters more than convenience, pyperf is the gold standard.
airspeed-velocity: Git-Based Performance Tracking
airspeed-velocity (ASV) takes a unique approach: it benchmarks your code across git commits, making it ideal for finding exactly which commit introduced a performance regression.
Installation:
| |
Setup:
| |
Configuration (asv.conf.json):
| |
Writing ASV Benchmarks:
| |
Running:
| |
ASV’s web interface shows performance timelines, making it straightforward to visualize when regressions were introduced and by which commits. This makes it invaluable for post-mortem analysis of performance bugs.
CI Integration Pattern
For comprehensive performance regression detection, combine multiple tools:
| |
Why Self-Host Your Benchmarking Pipeline?
Performance benchmarking should be part of your CI pipeline, not an afterthought. Self-hosted benchmarking tools give you complete control over the measurement environment — consistent hardware, isolated processes, and no shared-CI noise. Unlike SaaS performance monitoring platforms that charge per benchmark-minute, self-hosted tools run on your infrastructure at zero marginal cost.
Benchmarking complements other quality practices covered in our guides. For static analysis, see our Python type checkers guide. For runtime safety, our rate limiting libraries comparison covers protecting your APIs. Our Python profiling tools guide covers tools for finding performance hotspots that you should then benchmark.
FAQ
How do I get reliable benchmarks on shared CI runners?
Shared CI runners (like GitHub Actions free tier) are noisy — CPU throttling, co-tenancy, and varying load affect results. Mitigation strategies: (1) use CodSpeed’s calibration, (2) run benchmarks multiple times and use median not mean, (3) use pyperf’s system tuning (pyperf system tune), (4) for critical benchmarks, use self-hosted runners on dedicated hardware, (5) set a regression threshold of at least 5-10% to avoid false positives from noise.
Should I benchmark in unit tests or separate benchmark files?
Start with unit-level benchmarks alongside your tests (pytest-benchmark in test files). They catch obvious regressions with minimal overhead. As your project matures, add dedicated benchmark suites (separate benchmarks/ directory) for more thorough, longer-running benchmarks that you run less frequently or on a schedule.
What makes a good benchmark?
A good benchmark: (1) runs fast enough to complete in CI (under 1 second total), (2) measures a single, well-defined operation, (3) uses realistic input sizes, (4) avoids I/O (disk, network) which introduces noise, (5) includes setup/teardown that isn’t counted in measurement time. Bad benchmarks measure wall-clock time of operations that involve network calls, database queries, or filesystem access — these are integration tests, not benchmarks.
How do I compare benchmarks across different Python versions?
Use pyperf with system tuning or CodSpeed with its calibration system. Both normalize for hardware differences. For ASV, create separate environments per Python version and run benchmarks in each. Always record the Python version and system information alongside benchmark results for fair comparison.
Can benchmarking catch algorithmic regressions?
Yes, if your benchmarks use realistic input sizes. An O(n) function replacing an O(n log n) one won’t show much difference at n=100, but benchmarks at n=10000 will reveal the regression. Parametrize your benchmarks with multiple input sizes to catch algorithmic complexity regressions.
💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com