Introduction
The Linux CPU scheduler is responsible for deciding which task runs on which CPU at any given moment. With the Completely Fair Scheduler (CFS) and the newer EEVDF scheduler in kernel 6.6+, understanding scheduler behavior is essential for diagnosing latency spikes, CPU contention, and uneven load distribution across cores. Linux provides several built-in tools for scheduler analysis, each offering a different level of detail and operational overhead.
This guide compares three primary Linux scheduler analysis interfaces: schedstat (runtime scheduling statistics), sched_debug (detailed per-CPU scheduler state), and perf sched (performance-event-based scheduler tracing). Each tool addresses different analysis needs — from quick health checks to deep latency investigations.
Comparison Table
| Feature | schedstat | sched_debug | perf sched |
|---|---|---|---|
| Data Source | /proc/schedstat | /sys/kernel/debug/sched/debug | perf event subsystem |
| Granularity | Per-CPU aggregate counters | Per-CPU, per-runqueue, per-task | Per-task, per-event |
| Overhead | Negligible (<0.1% CPU) | Low (read-only snapshot) | Medium-High (event tracing) |
| Enabled By Default | Yes (CONFIG_SCHEDSTATS) | Yes (requires debugfs mount) | Yes (perf subsystem) |
| Historical Analysis | Manual collection | Snapshot only | Yes (trace recording) |
| Latency Analysis | Aggregate wait/run times | Per-task scheduling details | Microsecond-level event timeline |
| Best For | Health monitoring, dashboards | Detailed state inspection | Latency debugging, regression testing |
| Output Format | Key-value counters | Human-readable dump | Binary trace (perf.data) |
schedstat: Runtime Scheduling Counters
The /proc/schedstat file provides aggregate per-CPU scheduling statistics. It is the lightest-weight option, suitable for continuous monitoring in production environments.
| |
Key metrics from schedstat:
- sched_count: Total schedule() calls (context switches initiated by this CPU)
- sched_goidle: Times the scheduler found no runnable task (CPU went idle)
- ttwu_count: Try-to-wake-up events (tasks woken on this CPU)
- ttwu_local: Wake-ups where the waking and target CPU are the same
schedstat Monitoring with Docker Compose:
| |
sched_debug: Detailed Scheduler State
The sched_debug interface provides an exhaustive snapshot of the scheduler’s internal state — per-CPU runqueues, per-task scheduling statistics, and load balancing details. This is the tool to reach for when investigating why a specific task is not getting enough CPU time.
| |
Key sections in sched_debug output:
Per-CPU runqueue information:
| |
Per-task scheduling statistics:
| |
Load balancing domain statistics:
| |
Automated sched_debug Collection
| |
perf sched: Event-Based Scheduler Tracing
perf sched provides the deepest scheduler analysis by recording tracepoint events and reconstructing the scheduling timeline. It can identify sub-millisecond latency issues that aggregate counters cannot capture.
| |
Sample perf sched latency output:
| |
perf sched timehist for visualizing wakeup chains:
| |
Tool Selection Guide
| Use Case | Recommended Tool | Why |
|---|---|---|
| Production monitoring | schedstat | Zero overhead, easy to scrape |
| Troubleshooting CPU contention | sched_debug | Per-task state and runqueue depth |
| Latency investigation | perf sched | Microsecond event timeline |
| Load balancing issues | sched_debug | Domain statistics and failure counts |
| Wakeup latency regression | perf sched latency | Wakeup-to-schedule gap analysis |
| Historical trend analysis | schedstat + Prometheus | Persistent counter storage |
Why Self-Host Your Scheduler Monitoring?
Scheduler analysis is deeply system-specific — cloud monitoring services cannot access kernel debugging interfaces on your bare metal or virtual machines. Running schedstat collection locally and shipping metrics to a self-hosted Prometheus instance gives you continuous visibility into scheduler health. Combined with our performance profiling guide, scheduler metrics complete the picture of CPU resource utilization. For alternative scheduler deployments, our BPF scheduler guide covers sched-ext and custom scheduling policies. If you are managing containerized workloads, our cgroup v2 guide covers CPU bandwidth control through cgroups.
Choosing the Right Tool for Your Investigation
Selecting between schedstat, sched_debug, and perf sched depends on the type of problem you are trying to solve. For continuous monitoring, schedstat counters scraped into Prometheus provide always-on visibility without any measurable overhead. If you notice CPU latency spikes in application metrics, sched_debug gives you an immediate snapshot of runqueue depth and per-task scheduling statistics that can reveal CPU hogs. For deep-dive regression investigations — such as a kernel upgrade causing 5ms latency increases in a real-time application — perf sched recording with timehist analysis provides the microsecond-level event timeline needed to identify the root cause.
These tools also complement each other. A typical troubleshooting workflow starts with schedstat to confirm that context switch rates have increased, then moves to sched_debug to identify which tasks and CPUs are affected, and finally uses perf sched recording to capture the exact sequence of scheduling events that caused the issue. Keeping all three tools in your system administration toolkit ensures you can respond to scheduler issues at any level of detail.
FAQ
Do I need to enable CONFIG_SCHEDSTATS in my kernel?
On most modern distributions (kernel 5.x+), CONFIG_SCHEDSTATS is enabled by default. Check with zgrep CONFIG_SCHEDSTATS /proc/config.gz. If missing, /proc/schedstat will be empty or only show version info without per-CPU counters. Recompiling the kernel with this option enabled is straightforward on Debian/Ubuntu via make menuconfig.
Why does sched_debug show negative nr_uninterruptible?
The nr_uninterruptible counter in sched_debug tracks tasks in uninterruptible sleep (D state). A negative value indicates a kernel accounting bug — the counter decrements more often than it increments, typically due to a race condition in the task state tracking. This is a cosmetic issue in most cases and does not affect scheduler correctness. Upgrading to a newer kernel may resolve it.
How much overhead does perf sched recording add?
perf sched record adds approximately 1-5% CPU overhead on modern systems, depending on the event rate. On systems with very high context switch rates (>100K/s), overhead can reach 10-15%. Use short recording windows (30-60 seconds) and avoid running in production during peak hours. perf sched latency (analysis phase) has zero overhead since it post-processes an existing trace file.
Can I use these tools inside containers?
schedstat (/proc/schedstat) and sched_debug (/sys/kernel/debug/sched/debug) require access to the host’s procfs and debugfs. Run containers with --pid=host and mount /proc:/host/proc:ro plus /sys/kernel/debug:/sys/kernel/debug:ro to access these interfaces. perf sched requires CAP_SYS_ADMIN and access to the host’s perf_event subsystem.
What does a high sched_goidle count indicate?
sched_goidle counts the number of times the scheduler found no runnable task and the CPU went idle. A high count relative to sched_count (context switches) indicates the system has spare CPU capacity. A low ratio — where sched_goidle is near zero — means the CPU is saturated and continuously has work to schedule. For latency-sensitive workloads, this saturation point is where queuing delays begin to accumulate.
💡 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到 科技监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测 科技相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com