← Back to posts
comparison guide self-hosted · · 9 min read

Best Self-Hosted GPU Monitoring Tools 2026: nvtop vs DCGM Exporter vs Netdata

Compare the best open-source GPU monitoring tools for self-hosted servers in 2026. Detailed guide covering nvtop, NVIDIA DCGM Exporter, and Netdata with Docker Compose configs and Grafana dashboards.

OS
Editorial Team

If you run a self-hosted server with a GPU — whether for Jellyfin hardware transcoding, machine learning workloads, game servers, or cryptocurrency mining — you need visibility into GPU utilization, memory usage, temperature, and power draw. Without proper monitoring, a runaway process can silently consume all VRAM, thermal throttling can cripple performance, and hardware failures can go unnoticed until it’s too late.

In this guide, we compare three of the most popular open-source GPU monitoring tools for self-hosted servers: nvtop (terminal-based GPU process monitor), NVIDIA DCGM Exporter (Prometheus metrics exporter), and Netdata (real-time system observability with built-in GPU plugins).

Why Monitor Your GPU?

GPUs are among the most expensive components in a server, and they’re also the most sensitive to misconfiguration. Here’s why GPU monitoring matters:

  • Thermal management: Sustained high temperatures reduce GPU lifespan and trigger throttling
  • Memory leaks: ML frameworks can silently allocate VRAM without releasing it
  • Resource contention: Multiple services (transcoding + inference + gaming) compete for GPU time
  • Power costs: Unoptimized GPU workloads consume significant electricity
  • Capacity planning: Historical metrics tell you when it’s time to upgrade

Unlike CPU monitoring, which has decades of mature tooling, GPU monitoring requires specialized tools that understand accelerator architectures, memory pools, and vendor-specific metrics.

At a Glance: Comparison Table

FeaturenvtopDCGM ExporterNetdata
Primary useTerminal htop for GPUsPrometheus metrics exporterReal-time system dashboard
GPU vendorsAMD, Intel, NVIDIA, Apple, HuaweiNVIDIA onlyNVIDIA, AMD (limited)
InterfaceTerminal (ncurses)Prometheus + GrafanaWeb dashboard
Process trackingPer-process GPU usageNo process-level detailBasic process info
AlertingNo (visual only)Via Prometheus AlertmanagerBuilt-in alerts
Historical dataNoVia Prometheus TSDBBuilt-in storage
Docker supportHost access requiredNative Docker containerDocker agent or cloud
GitHub stars10,5121,69078,554
Best forQuick diagnosticsProduction monitoringAll-in-one observability

nvtop — htop for GPUs

nvtop is a terminal-based GPU monitoring tool that works like htop or top, but for graphics cards. It shows real-time utilization, memory consumption, temperature, and a list of processes using the GPU — all in a single terminal window.

Key Features

  • Multi-vendor support: Works with NVIDIA, AMD, Intel, Apple Silicon, Huawei Ascend, and Qualcomm Adreno GPUs
  • Process monitoring: See exactly which processes are consuming GPU resources, similar to nvidia-smi but with a live-updating interface
  • Graph history: Rolling utilization and memory graphs within the terminal
  • Color-coded bars: Visual indication of compute vs. memory vs. encode/decode utilization
  • Zero dependencies: No database, no web server, no configuration files

Installation

On Ubuntu/Debian:

1
2
sudo apt update
sudo apt install nvtop

On Arch Linux:

1
sudo pacman -S nvtop

On Fedora:

1
sudo dnf install nvtop

You can also build from source for the latest version:

1
2
3
4
5
6
git clone https://github.com/Syllo/nvtop.git
cd nvtop
mkdir -p build && cd build
cmake .. -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON -DINTEL_SUPPORT=ON
make -j$(nproc)
sudo make install

Usage

Simply run:

1
nvtop

For GPU selection (if you have multiple GPUs):

1
2
nvtop -d 0    # Monitor GPU 0 only
nvtop -g      # Show GPU selection menu

When to Use nvtop

nvtop excels as a diagnostic tool. SSH into your server, type nvtop, and immediately see what’s eating your GPU. It’s perfect for:

  • Debugging why a GPU is at 100% utilization
  • Checking if a container is properly accessing the GPU
  • Quick thermal checks before starting a heavy workload
  • Monitoring GPU memory leaks in development

However, nvtop does not provide historical data, alerting, or remote monitoring. You need to be at the terminal to see it.

NVIDIA DCGM Exporter — Production-Grade GPU Metrics

The NVIDIA Data Center GPU Manager (DCGM) Exporter is the official Prometheus exporter for NVIDIA GPU telemetry. It leverages the DCGM library to collect deep hardware metrics and exposes them in Prometheus format for consumption by Grafana or other visualization tools.

Key Features

  • Deep NVIDIA metrics: Temperature, power, clock speeds, memory bandwidth, ECC errors, PCIe throughput, NVLink utilization
  • Prometheus-native: Exposes metrics on :9400/metrics for immediate scraping
  • Grafana dashboards: Community dashboards (ID 12239, 14583) provide instant visualization
  • Containerized: Runs as a Docker container with GPU passthrough
  • Alerting-ready: Integrates with Prometheus Alertmanager for threshold-based notifications

Docker Compose Configuration

Here’s a production-ready Docker Compose setup with DCGM Exporter, Prometheus, and Grafana:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
version: "3.8"

services:
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04
    container_name: dcgm-exporter
    runtime: nvidia
    cap_add:
      - SYS_ADMIN
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DCGM_EXPORTER_LISTEN=:9400
    ports:
      - "9400:9400"
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus-gpu
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana-gpu
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Create prometheus.yml in the same directory:

1
2
3
4
5
6
7
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "dcgm-exporter"
    static_configs:
      - targets: ["dcgm-exporter:9400"]

Start the stack:

1
docker compose up -d

Key Metrics Exposed

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization percentage (0-100)
DCGM_FI_DEV_MEM_COPY_UTILMemory copy/utilization percentage
DCGM_FI_DEV_FB_USEDFramebuffer (VRAM) used in MiB
DCGM_FI_DEV_FB_FREEFramebuffer free in MiB
DCGM_FI_DEV_GPU_TEMPGPU temperature in Celsius
DCGM_FI_DEV_POWER_USAGEPower draw in Watts
DCGM_FI_DEV_SM_CLOCKStreaming Multiprocessor clock MHz
DCGM_FI_DEV_ECC_DBE_VOL_TOTALVolatile double-bit ECC errors

Grafana Dashboard Setup

After starting Grafana, import the community NVIDIA DCGM dashboard:

  1. Navigate to http://localhost:3000
  2. Go to Dashboards → Import
  3. Enter dashboard ID 12239 (NVIDIA DCGM Exporter)
  4. Select your Prometheus data source
  5. Click Import

When to Use DCGM Exporter

DCGM Exporter is the right choice when you need production-grade, long-term GPU monitoring with alerting. It’s ideal for:

  • ML training clusters where you need historical utilization data
  • Multi-GPU servers where capacity planning matters
  • Teams already running Prometheus/Grafana infrastructure
  • Alerting on GPU temperature, ECC errors, or power anomalies

The main limitation is that it’s NVIDIA-only. AMD GPU users will need alternative tools.

Netdata — All-in-One Observability with GPU Monitoring

Netdata is a comprehensive real-time observability platform that monitors every aspect of a system — CPU, memory, disk, network, containers, and GPUs. Its GPU monitoring plugin automatically detects NVIDIA GPUs (via nvidia-smi) and AMD GPUs (via rocm-smi) and streams metrics to its web dashboard.

Key Features

  • Zero configuration: Auto-detects GPUs and starts collecting metrics immediately
  • Real-time web dashboard: 1-second resolution graphs accessible from any browser
  • Built-in alerting: Pre-configured alerts for GPU temperature, utilization, and memory
  • Multi-node support: Netdata Cloud aggregates metrics across multiple servers
  • Broad vendor support: NVIDIA via nvidia-smi, AMD via ROCm, Intel via Level Zero
  • Historical data: Built-in database with configurable retention

Docker Compose Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
version: "3.8"

services:
  netdata:
    image: netdata/netdata:latest
    container_name: netdata
    hostname: gpu-server
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdata-config:/etc/netdata
      - netdata-lib:/var/lib/netdata
      - netdata-cache:/var/cache/netdata
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "19999:19999"
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Start with:

1
docker compose up -d

Access the dashboard at http://your-server:19999. GPU metrics appear automatically under the NVIDIA GPU or AMD GPU section.

GPU Alert Configuration

Netdata includes pre-built GPU alerts. To customize thresholds, edit /etc/netdata/health.d/nvidia.conf:

1
2
3
4
5
6
7
8
template: nvidia_gpu_temp
      on: nvidia_gpu.temperature
   lookup: average -1m percentage
    every: 1m
     warn: $this > (($status >= $WARNING)  ? (80) : (75))
    crit: $this > (($status == $CRITICAL) ? (90) : (85))
     info: GPU temperature exceeds safe thresholds
       to: sysadmin

When to Use Netdata

Netdata is the best choice when you want a single monitoring tool for everything — not just GPUs. It’s ideal for:

  • Small-to-medium teams that don’t want to manage separate monitoring stacks
  • Servers where GPU is one component among many to monitor
  • Quick deployment with zero configuration
  • Teams that need built-in alerting without setting up Prometheus + Alertmanager

The trade-off is that Netdata’s GPU metrics are less deep than DCGM Exporter’s. You get temperature, utilization, and memory, but not PCIe throughput, NVLink stats, or ECC error counts.

Choosing the Right Tool

ScenarioRecommended Tool
Quick SSH diagnostic checknvtop
NVIDIA-only production monitoring with GrafanaDCGM Exporter
All-in-one server monitoring with GPU supportNetdata
Multi-vendor GPU environment (NVIDIA + AMD)nvtop or Netdata
ML training cluster with historical data needsDCGM Exporter + Prometheus
Jellyfin/Plex transcoding serverNetdata (simplest setup)
Alerting on GPU temperature/powerDCGM Exporter or Netdata

Combining Tools

Many server administrators use multiple tools together:

  • nvtop for quick SSH diagnostics
  • DCGM Exporter + Prometheus + Grafana for production dashboards and alerting
  • Netdata for overall system health monitoring

This gives you both the immediate terminal visibility and the long-term historical data needed for capacity planning and troubleshooting.

For related reading, see our Prometheus vs Grafana vs VictoriaMetrics comparison and the Grafana Pyroscope continuous profiling guide. If you’re running ML workloads, check our MLflow vs ClearML vs Aim experiment tracking guide.

FAQ

What is the best free GPU monitoring tool for Linux servers?

For terminal-based monitoring, nvtop is the best free option — it supports NVIDIA, AMD, Intel, and even Apple Silicon GPUs with a single command. For web-based monitoring with historical data, Netdata provides the most comprehensive free solution with zero configuration. For production NVIDIA environments, DCGM Exporter combined with Prometheus and Grafana offers the deepest metrics.

Can nvtop monitor AMD GPUs?

Yes. nvtop supports AMD GPUs via the amdgpu kernel driver. Install nvtop with AMD support enabled (most package managers include it by default) and it will automatically detect and display AMD GPU metrics including compute utilization, VRAM usage, and temperature.

Does DCGM Exporter work with consumer NVIDIA GPUs?

Yes, DCGM Exporter works with both data center GPUs (A100, H100, V100) and consumer GPUs (RTX 3090, RTX 4090, RTX 4080). However, some advanced metrics like NVLink and ECC error reporting may not be available on consumer cards. Core metrics like utilization, temperature, memory, and power draw work across all NVIDIA GPUs.

How do I set up GPU temperature alerts?

With Netdata, GPU temperature alerts are enabled by default. With DCGM Exporter + Prometheus, create an alerting rule in your prometheus.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
groups:
  - name: gpu_alerts
    rules:
      - alert: HighGPUTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} temperature above 85°C"

Can I monitor GPU usage inside Docker containers?

Yes, but it requires GPU passthrough. For nvtop, you need to run it on the host machine (not inside a container) to see all processes. For DCGM Exporter, the Docker container needs runtime: nvidia and cap_add: [SYS_ADMIN]. For Netdata, mount the NVIDIA driver libraries and /proc into the container as shown in the Docker Compose configuration above.

What GPU metrics should I monitor for ML training workloads?

For ML training, focus on: GPU utilization (should be near 100% during training), VRAM usage (watch for OOM errors), temperature (thermal throttling reduces training speed), and power draw (sustained max power indicates efficient GPU usage). DCGM Exporter provides all of these plus memory bandwidth and clock speed metrics that help identify bottlenecks in your training pipeline.

Advertise here
Advertise here