System monitoring at the host level requires more than CPU and memory metrics — you need visibility into individual processes. Which applications are consuming the most resources? Are critical services running? Have any processes entered problematic states? Process monitoring exporters answer these questions by exposing per-process metrics to your observability stack.
This guide compares three approaches to self-hosted process monitoring: the dedicated process-exporter for Prometheus, the Kubernetes Node Problem Detector for cluster-wide node health, and Telegraf with its procstat input plugin for unified metric collection. Each targets a different scale and use case, from single-host monitoring to enterprise Kubernetes clusters.
Why Self-Host Process Monitoring?
Process-level visibility is essential for production infrastructure. When a Java application starts consuming 90% of CPU, you need to know immediately — not after users report slowdowns. Process monitoring provides the granularity that aggregate host metrics cannot.
Self-hosting process monitoring gives you full control over metric retention, alerting rules, and data privacy. Unlike SaaS monitoring platforms, your process data never leaves your infrastructure. This is critical for regulated industries where process names, arguments, and resource usage patterns could reveal sensitive operational details.
Cost is another factor. Commercial APM platforms charge per host and per metric. Self-hosted process exporters feed into your existing Prometheus or Grafana stack at zero marginal cost, regardless of how many processes you monitor.
For GPU-specific monitoring, see our GPU monitoring comparison. For broader metric collection, our metrics collectors guide covers Telegraf, statsd, and Vector. Database-focused monitoring is covered in our PostgreSQL monitoring guide.
process-exporter: Dedicated Prometheus Process Metrics
process-exporter (2,117+ stars) is a Prometheus exporter that reads /proc filesystem data and exposes detailed per-process metrics. It groups processes by configurable name patterns and reports CPU, memory, file descriptor, and thread counts.
Features
- Process grouping — group processes by name, command line regex, or parent-child relationships
- Detailed metrics — CPU time, resident/set memory, virtual memory, file descriptors, threads, open files
- Configurable filtering — include/exclude processes by name, user, or command pattern
- Prometheus native — outputs standard Prometheus metrics, scrapable by any Prometheus server
- Low overhead — reads
/procdirectly, minimal CPU and memory impact
Docker Deployment
| |
Configuration file (config.yml):
| |
Prometheus Scraping
Add to your prometheus.yml:
| |
Key metrics exposed:
namedprocess_namegroup_cpu_seconds_total— CPU time per process groupnamedprocess_namegroup_memory_bytes— memory usage (resident, virtual, proportional)namedprocess_namegroup_open_filedesc— open file descriptorsnamedprocess_namegroup_num_procs— number of processes in groupnamedprocess_namegroup_threads— thread count
Grafana Dashboard
Import community dashboard ID 249 for a pre-built process monitoring view, or create custom panels:
| |
Node Problem Detector: Kubernetes Node Health
Kubernetes Node Problem Detector (3,408+ stars) is a daemon that runs on each Kubernetes node and detects conditions that could affect pod scheduling or node health. It monitors for hardware issues, kernel problems, and container runtime errors.
Features
- Node condition reporting — sets Kubernetes node conditions (Ready, MemoryPressure, DiskPressure)
- Hardware monitoring — detects kernel panics, OOM kills, disk errors, and network issues
- Custom monitors — supports custom monitoring scripts via JSON-based plugin system
- Event generation — creates Kubernetes events for detected problems
- Cloud provider integration — works with GKE, EKS, and AKS node health reporting
- Process monitoring — watches for critical process failures (kubelet, container runtime)
Kubernetes Deployment
Deploy as a DaemonSet across all nodes:
| |
Custom Problem Monitors
Create custom monitors by adding JSON configurations to /etc/node-problem-detector/config.d/:
| |
Custom monitoring script (check_process.sh):
| |
Node Conditions
Node Problem Detector reports conditions that affect scheduling:
- KernelDeadlock — kernel is not responding
- ReadonlyFilesystem — root filesystem mounted read-only
- CorruptDockerOverlay — Docker overlay filesystem corruption
- MemoryPressure — node memory critically low
- DiskPressure — node disk space critically low
These conditions automatically prevent new pod scheduling on affected nodes.
Telegraf procstat: Unified Metric Collection
Telegraf (14,000+ stars) is a plugin-driven metric collection agent that includes a powerful procstat input plugin for process monitoring. Unlike dedicated exporters, Telegraf collects process metrics alongside system, network, and application metrics in a single agent.
Features
- Unified collection — process metrics combined with 300+ other input plugins
- Pattern matching — filter processes by name, executable, command line, or user
- Extensive metrics — CPU percentage, memory usage, file descriptors, threads, IO bytes, context switches
- Multiple outputs — send metrics to InfluxDB, Prometheus, Kafka, Elasticsearch, and more
- Cross-platform — Linux, Windows, macOS support
Docker Deployment
| |
Telegraf configuration (telegraf.conf):
| |
Process Metric Collection
The procstat plugin collects:
procstat_lookup— number of matching processesprocstat_cpu_usage— CPU utilization percentage per processprocstat_memory_rss— resident set sizeprocstat_memory_vms— virtual memory sizeprocstat_num_fds— open file descriptor countprocstat_num_threads— thread countprocstat_read_bytes/procstat_write_bytes— disk IO per processprocstat_voluntary_context_switches/procstat_involuntary_context_switches— scheduling metrics
Feature Comparison
| Feature | process-exporter | Node Problem Detector | Telegraf procstat |
|---|---|---|---|
| Primary role | Process metrics exporter | Node health detector | Unified metric agent |
| Metric format | Prometheus | Kubernetes events + Prometheus | Multiple (InfluxDB, Prometheus, etc.) |
| Process grouping | Yes (configurable patterns) | No (individual processes) | Yes (pattern matching) |
| CPU metrics | Yes (cumulative + rate) | Limited (node-level) | Yes (percentage-based) |
| Memory metrics | Yes (RSS, VMS, PSS) | Node-level only | Yes (RSS, VMS) |
| File descriptors | Yes | No | Yes |
| IO metrics | No | No | Yes (read/write bytes) |
| Kubernetes integration | Manual (scrape config) | Native (DaemonSet, node conditions) | Manual (sidecar or host agent) |
| Alerting | Via Prometheus rules | Via Kubernetes events | Via output plugins |
| Docker image | Docker Hub | registry.k8s.io | Docker Hub |
| Stars | 2,117+ | 3,408+ | 14,000+ |
| Best for | Prometheus-centric monitoring | Kubernetes cluster health | Multi-output metric collection |
Choosing the Right Solution
Use process-exporter when you run Prometheus and need dedicated, high-granularity process metrics. Its configurable grouping lets you aggregate metrics by application (all Java processes, all Postgres workers) rather than tracking individual PIDs. Ideal for teams already invested in the Prometheus/Grafana ecosystem.
Use Node Problem Detector when you manage a Kubernetes cluster and need automated node health detection. It integrates natively with Kubernetes scheduling, automatically cordoning unhealthy nodes. Best for platform teams running production Kubernetes workloads who need proactive node issue detection.
Use Telegraf when you need process monitoring alongside other metric collection (system, network, application). Its 300+ input plugins make it the most versatile option, especially if your observability stack uses InfluxDB or you need to send metrics to multiple destinations simultaneously.
FAQ
Can I run process-exporter without Docker?
Yes. Download the binary from GitHub releases and run it directly: ./process-exporter --procfs /proc --config.path config.yml. The Docker approach is recommended for easier updates and isolation, but the binary works on any Linux system.
Does Node Problem Detector replace Prometheus monitoring?
No. Node Problem Detector focuses on node-level health conditions (kernel issues, disk errors, OOM kills) and reports them as Kubernetes events and node conditions. It does not provide the detailed time-series metrics that Prometheus exporters offer. Many teams run both: NPD for node health and process-exporter for application-level metrics.
How do I monitor specific processes with Telegraf?
Use the pattern field in the procstat input configuration to match process names or command lines. For example, pattern = "java" matches all Java processes. You can also use exe for exact executable names, user to filter by process owner, or pid_file to track processes by their PID file.
Can process-exporter monitor Windows processes?
No. process-exporter reads the Linux /proc filesystem and is Linux-only. For Windows process monitoring, use Telegraf’s procstat plugin, which supports both Linux and Windows.
How often should I scrape process metrics?
For most use cases, a 10-30 second scrape interval is sufficient. Process-exporter and Telegraf both have minimal overhead when reading /proc. Node Problem Detector runs checks every 30 seconds by default. Avoid sub-5-second intervals as they can cause measurable CPU overhead on systems with many processes.
What happens if a critical process crashes?
With process-exporter, the process count metric drops to zero — set up a Prometheus alert rule: namedprocess_namegroup_num_procs{groupname="nginx"} == 0. Node Problem Detector generates a Kubernetes event and can set a node condition. Telegraf’s procstat lookup count drops — alert on procstat_lookup{result="success"} < 1.