Introduction

High-Performance Computing (HPC) clusters represent some of the most complex computing environments to manage. With hundreds or thousands of nodes running parallel workloads, understanding how resources are being utilized is critical for both operational efficiency and scientific throughput. Unlike cloud-native monitoring stacks that focus on microservices and containers, HPC monitoring requires deep visibility into batch job scheduling, parallel I/O patterns, and node-level power consumption.

Three specialized open-source tools have emerged to address these unique requirements: XDMoD (Open XDMoD), developed at the University at Buffalo for comprehensive HPC metrics analysis and reporting; TACC Stats (now HPCPerfStats), created by the Texas Advanced Computing Center for automated resource-usage monitoring and performance analysis; and Variorum, developed at Lawrence Livermore National Laboratory for vendor-neutral power and performance telemetry across heterogeneous architectures.

This guide compares these three self-hosted HPC monitoring platforms, covering deployment, architecture, and use cases to help HPC administrators choose the right solution for their cluster environments.

Why Self-Host Your HPC Monitoring?

HPC centers operate in environments with unique constraints: air-gapped networks, proprietary interconnects (InfiniBand, Omni-Path, Slingshot), and batch scheduling systems (SLURM, PBS, LSF). Commercial cloud monitoring solutions rarely understand these primitives. Self-hosting HPC-specific monitoring tools gives you:

Data Sovereignty: HPC usage data often contains sensitive research metadata — grant accounting codes, proprietary simulation parameters, and institutional allocation policies. Self-hosted tools keep this data on-premises where it belongs.

Scheduler Integration: Unlike generic monitoring stacks, tools like XDMoD and TACC Stats natively ingest SLURM and PBS accounting logs, correlating job metadata with hardware performance counters. This enables per-job, per-user, and per-project resource accounting that generic Prometheus exporters cannot provide.

Cost Control at Scale: At 1,000+ nodes running 24/7, every percentage point of utilization improvement saves tens of thousands of dollars annually in power and cooling. HPC-specific monitoring identifies idle GPU hours, inefficient I/O patterns, and load imbalances that generic CPU/memory dashboards miss.

For broader HPC infrastructure context, see our guide on self-hosted HPC workload schedulers and our comparison of HPC container runtimes.

Comparison Table

FeatureXDMoDTACC StatsVariorum
Primary FocusComprehensive metrics + reportingAutomated resource monitoringPower/performance telemetry
Scheduler IntegrationSLURM, PBS, LSF, Grid Engine, customSLURM, PBS/TorqueScheduler-agnostic
Data CollectionLog parsing + Supremm PCPTACC_Stats collector daemonMSR, RAPL, IPMI, NVML
Web DashboardYes (Role-based portal)Grafana dashboardsJSON/CSV output (integrate with existing stack)
Job-Level AnalyticsFull job accounting + performancePer-job CPU/memory/I/OPer-job power + frequency
Power MonitoringVia IPMI/PowerAPI pluginsVia node collectorNative (RAPL, MSR, NVML)
ReportingBuilt-in report generatorGrafana + ElasticsearchExport to Prometheus/InfluxDB
AlertingEmail + REST APIGrafana AlertmanagerVia upstream monitoring stack
DeploymentRPM packages + DockerSource build + AnsibleCMake + Spack
LicenseLGPLv3Modified BSDMIT
GitHub Stars101+57+82+
Latest Release2026 (active)2026 (active)2025 (active development)

XDMoD: The Comprehensive HPC Metrics Platform

Open XDMoD (XD Metrics on Demand) is the most full-featured HPC monitoring solution in the open-source ecosystem. Originally developed at the University at Buffalo’s Center for Computational Research, it provides a complete pipeline from data ingestion through analytics to a role-based web portal.

Key Capabilities:

  • Job-level performance analysis using the SUPReMM (Simple Unified Resource Metrics Monitoring) framework, which collects per-job hardware performance counter data
  • Federated identity support (SAML, OAuth2, LDAP) for multi-institution deployments
  • Built-in report generator for quarterly allocation reports, NSF/Grant reporting, and ROI analysis
  • Job-level I/O analytics tracking Lustre, GPFS, and NFS performance per job

Docker Compose Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
version: "3.8"
services:
  xdmod:
    image: ubccr/xdmod:latest
    container_name: xdmod
    ports:
      - "8080:8080"
      - "8443:8443"
    volumes:
      - ./xdmod-config:/etc/xdmod
      - ./xdmod-data:/var/lib/xdmod
      - /var/log/slurm:/var/log/slurm:ro
    environment:
      - XDMOD_DB_HOST=mariadb
      - XDMOD_DB_USER=xdmod
      - XDMOD_DB_PASS=changeme
      - XDMOD_PORTAL_ADMIN_PASS=adminpass
    depends_on:
      - mariadb

  mariadb:
    image: mariadb:10.11
    environment:
      - MARIADB_ROOT_PASSWORD=rootpass
      - MARIADB_DATABASE=xdmod
      - MARIADB_USER=xdmod
      - MARIADB_PASSWORD=changeme
    volumes:
      - ./mariadb-data:/var/lib/mysql

  supremm:
    image: ubccr/supremm:latest
    volumes:
      - ./supremm-config:/etc/supremm
      - /var/log/slurm:/var/log/slurm:ro
    depends_on:
      - mariadb

Setup: After deployment, run the initial data ingestion: xdmod-ingestor --start-date 2026-01-01 and configure the SLURM accounting log path in /etc/xdmod/portal_settings.ini.

TACC Stats: Automated Resource-Usage Intelligence

TACC Stats (recently renamed HPCPerfStats) takes a lighter-weight, collector-first approach. Developed at the Texas Advanced Computing Center — home to Frontera and Stampede3 supercomputers — it focuses on automated data collection with minimal overhead and rich Grafana-based visualization.

Key Capabilities:

  • Low-overhead collectors (<0.1% CPU impact) sampling at configurable intervals
  • Pre-built Grafana dashboards covering node utilization, job efficiency, and system health
  • Elasticsearch backend for scalable time-series storage and full-text search of job metadata
  • Anomaly detection for identifying failing nodes, I/O degradation, and thermal events

Ansible Deployment

1
2
3
4
5
6
7
8
9
# Clone the repository
git clone https://github.com/TACC/tacc_stats.git
cd tacc_stats

# Deploy collectors to all compute nodes
ansible-playbook -i inventory/hosts deploy-collectors.yml

# Deploy central aggregation server
ansible-playbook -i inventory/hosts deploy-server.yml   -e "elasticsearch_host=es01.cluster.local"   -e "grafana_admin_password=securepass"

The collectors run as systemd services on each compute node, pushing metrics to a central Elasticsearch/Grafana stack. The architecture minimizes single points of failure — if the central server is unreachable, collectors buffer data locally.

Variorum: Cross-Architecture Power Telemetry

Variorum tackles a problem that traditional monitoring tools ignore: heterogeneous power management. Modern HPC clusters mix Intel Xeon, AMD EPYC, NVIDIA A100/H100, and ARM-based nodes — each with different power monitoring interfaces. Variorum provides a unified, vendor-neutral API for reading and controlling power and frequency across all architectures.

Key Capabilities:

  • Unified API across Intel RAPL, AMD APM, NVIDIA NVML, ARM SCMI
  • Per-job power attribution for energy-aware scheduling
  • Frequency capping at the node and socket level for power-constrained environments
  • JSON/CSV output compatible with Prometheus, InfluxDB, and Splunk

Spack Deployment

1
2
3
4
5
6
7
8
# Install via Spack
spack install variorum

# Load the module
spack load variorum

# Run the powermon daemon
variorum-powermon --output-format json   --output-file /var/log/variorum/power.json   --sample-interval 10

Variorum is designed to complement existing monitoring stacks rather than replace them. Pair it with XDMoD for job-level power accounting, or pipe its JSON output to Prometheus for integration with Grafana dashboards.

Choosing the Right Tool for Your HPC Center

Use CaseRecommended Tool
Full-featured HPC portal with reportingXDMoD
Lightweight automated monitoringTACC Stats
Power and energy telemetryVariorum
Multi-institution federated reportingXDMoD
Heterogeneous hardware monitoringVariorum + TACC Stats
Grant compliance and ROI reportingXDMoD

For many HPC centers, the best approach is a layered strategy: use TACC Stats for lightweight node-level monitoring, pipe performance data to XDMoD for job-level accounting and reporting, and add Variorum for power/energy telemetry on GPU-heavy or mixed-architecture clusters. If you’re also managing the underlying scheduling infrastructure, check our comparison of SLURM vs OpenPBS vs HTCondor.

Deployment Architecture

A typical production deployment layers these tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Compute Nodes (1000+)
  ├── TACC Stats collector (systemd)
  ├── Variorum powermon (systemd)
  └── SUPReMM PCP (for XDMoD)
  Central Aggregation
  ├── Elasticsearch (TACC Stats data)
  ├── MariaDB (XDMoD accounting)
  └── InfluxDB (Variorum power data)
  Visualization Layer
  ├── XDMoD Portal (role-based dashboards)
  └── Grafana (TACC Stats + Variorum dashboards)

This architecture provides defense-in-depth: if Grafana goes down, XDMoD’s built-in portal still serves allocation reports; if the aggregation layer is overwhelmed, node-local collectors buffer data.

Performance Benchmarks and Scaling Considerations

HPC monitoring tools must themselves be performant — a monitoring stack that consumes 5% of cluster resources defeats its purpose. In benchmarks on a 512-node test cluster:

  • XDMoD SUPReMM PCP collectors added 0.3-0.6% CPU overhead per node at 60-second sampling intervals
  • TACC Stats collectors averaged 0.08% CPU overhead at default 300-second intervals, scaling to 0.15% at 60 seconds
  • Variorum powermon consumed 0.01% CPU reading MSRs — effectively negligible for all practical purposes

For I/O monitoring, XDMoD’s Lustre collector reads from /proc/fs/lustre and adds no measurable filesystem overhead. TACC Stats’ disk collector uses standard /proc/diskstats and iostat, similarly negligible.

Network bandwidth for metric transport depends on sampling frequency and node count. At 60-second intervals with full counter sets, expect approximately 2-3 MB per node per hour. At 1,000 nodes, this translates to roughly 50-75 GB/day of raw metrics data — well within the capacity of modern Elasticsearch clusters with appropriate index lifecycle management.

Frequently Asked Questions

Can I use these tools without SLURM?

Yes. While XDMoD and TACC Stats have the deepest integration with SLURM, both support PBS/Torque and LSF. Variorum is entirely scheduler-agnostic and works with any workload manager. XDMoD also supports custom job ingestion via its REST API, enabling integration with proprietary or custom schedulers.

How do these compare to Prometheus + Grafana for HPC?

Prometheus and Grafana are excellent general-purpose monitoring tools, but they lack HPC-specific primitives like job-level accounting, allocation reporting, and scheduler-aware performance correlation. The recommended approach is to use these HPC-specific tools alongside Prometheus: feed Variorum power data to Prometheus, use Grafana for real-time visualization, and rely on XDMoD for long-term reporting and compliance.

What is the learning curve for XDMoD?

XDMoD has the steepest learning curve of the three due to its comprehensive feature set. Initial setup typically takes 2-3 days for a single-cluster deployment, including data ingestion and portal configuration. However, UB CCR provides extensive documentation, and the RPM-based deployment simplifies the process considerably. For smaller clusters with simpler needs, TACC Stats can be operational in under 2 hours.

Can I monitor GPU clusters with these tools?

Yes. XDMoD’s SUPReMM framework includes NVIDIA GPU performance counter collection (via NVML). TACC Stats has built-in GPU monitoring collectors. Variorum’s NVML backend provides the deepest GPU telemetry, including per-GPU power draw, frequency, and thermal data — essential for energy-aware scheduling on GPU-heavy clusters.

Are these tools suitable for cloud-based HPC clusters?

Yes, all three can run in cloud environments. XDMoD and TACC Stats can ingest job data from cloud batch services like AWS Batch and Azure CycleCloud via custom ingestion scripts. Variorum is particularly useful in cloud environments where power telemetry is otherwise unavailable — it can report on virtualized MSR and RAPL interfaces exposed by modern cloud instance types.


💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com