High Performance Computing (HPC) clusters require sophisticated workload managers to schedule jobs, allocate resources, and maximize throughput across hundreds or thousands of compute nodes. Whether you run a research cluster, an ML training farm, or a rendering pipeline, choosing the right workload manager is critical.

This guide compares three leading open-source HPC workload managers: Slurm, OpenPBS, and HTCondor — helping you pick the right tool for your cluster.

Quick Comparison Table

FeatureSlurmOpenPBSHTCondor
GitHub Stars~3,946~792~314
Primary FocusHPC job schedulingBatch job managementHigh-throughput computing
Scheduling ModelPriority-based, backfillFIFO with prioritiesMatchmaking, opportunistic
Resource TypesCPU, GPU, memory, nodesCPU, memory, nodesCPU, memory, storage
Multi-ClusterYes (federation)Via job routingYes (flocking)
Web UISlurm-web, Open OnDemandPBS Professional Web UIOpen OnDemand, condor_web
LicenseGPLv2PostgreSQL LicenseApache 2.0
Container SupportYes (container plugin)Yes (Docker/Singularity)Yes (Docker/Singularity)
GPU SchedulingNative (gres)Via resourcesVia machine ads
Active DevelopmentVery activeActiveActive

Slurm — The Industry Standard

Slurm (Simple Linux Utility for Resource Management) is the most widely deployed HPC workload manager in the world. It powers over 60% of the TOP500 supercomputers, including the world’s fastest systems.

Key Features

  • Fast scheduling: Sub-second scheduling decisions for large clusters
  • Backfill scheduling: Maximizes utilization by fitting smaller jobs into gaps
  • Resource limits: Fine-grained control over CPU, memory, GPU, and node allocation
  • Job arrays: Submit thousands of similar jobs with a single command
  • Federation: Connect multiple Slurm clusters for cross-cluster job submission
  • GPU support: Generic resource (gres) plugin for native GPU scheduling
  • Container integration: Supports Singularity, Charliecloud, and Shifter

Installation

Slurm is typically installed via package manager on the controller and compute nodes:

1
2
3
4
5
6
7
8
# On Ubuntu/Debian (controller)
sudo apt update
sudo apt install -y slurmctld slurmdbd munge
sudo systemctl enable --now munge slurmctld slurmdbd

# On compute nodes
sudo apt install -y slurmd munge
sudo systemctl enable --now munge slurmd

Docker Deployment

While Slurm is traditionally deployed on bare metal, you can run it in Docker for development and testing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
version: "3.8"
services:
  slurmctld:
    image: slurmurm/slurmctld:latest
    container_name: slurmctld
    hostname: slurmctld
    environment:
      - SLURM_CLUSTER_NAME=mycluster
      - SLURM_NODELIST=compute[1-2]
      - SLURM_PARTITION=compute,compute[1-2]
    volumes:
      - slurm-state:/etc/slurm
      - /var/run/munge/munge.socket.2:/var/run/munge/munge.socket.2
    networks:
      slurm-net:
        ipv4_address: 10.10.0.2

  compute1:
    image: slurmurm/slurmd:latest
    container_name: compute1
    hostname: compute1
    environment:
      - SLURM_CLUSTER_NAME=mycluster
      - SLURMD_NODENAME=compute1
    depends_on:
      - slurmctld
    networks:
      slurm-net:
        ipv4_address: 10.10.0.3

  compute2:
    image: slurmurm/slurmd:latest
    container_name: compute2
    hostname: compute2
    environment:
      - SLURM_CLUSTER_NAME=mycluster
      - SLURMD_NODENAME=compute2
    depends_on:
      - slurmctld
    networks:
      slurm-net:
        ipv4_address: 10.10.0.4

volumes:
  slurm-state:

networks:
  slurm-net:
    driver: bridge
    ipam:
      config:
        - subnet: 10.10.0.0/24

Configuration Example

A minimal slurm.conf for a small cluster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
ClusterName=mycluster
ControlMachine=slurmctld
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/lib/slurmctld
SlurmdSpoolDir=/var/lib/slurmd

# Nodes
NodeName=compute[1-2] CPUs=8 RealMemory=16000 State=UNKNOWN

# Partition
PartitionName=compute Nodes=compute[1-2] Default=YES MaxTime=INFINITE State=UP

Reverse Proxy Setup (Nginx)

For web UI access via Slurm-web or Open OnDemand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
server {
    listen 80;
    server_name slurm.example.com;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

OpenPBS — The Flexible Alternative

OpenPBS is an open-source batch queuing system originally developed by NASA. It manages job queues, allocates resources, and provides fair-share scheduling for compute clusters.

Key Features

  • Job arrays: Submit parameterized job collections
  • Routing queues: Automatically route jobs to appropriate execution queues
  • Fair-share scheduling: Balance resource allocation across users and groups
  • Mom hooks: Custom Python hooks for job lifecycle events
  • Resource limits: Control CPU, memory, walltime, and custom resources
  • Job dependencies: Chain jobs with before/after dependencies
  • Checkpoint/restart: Save and resume long-running jobs

Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# On Ubuntu/Debian
sudo apt update
sudo apt install -y openpbs-server openpbs-execution
sudo systemctl enable --now pbs

# Initialize PBS
. /etc/pbs.conf
sudo -u pbs /opt/pbs/bin/qmgr -c "create server"
sudo /opt/pbs/bin/qmgr -c "set server scheduling=true"
sudo /opt/pbs/bin/qmgr -c "set queue batch resources_default.walltime = 24:00:00"

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
version: "3.8"
services:
  pbs-server:
    image: linuxserver/openpbs:latest
    container_name: pbs-server
    hostname: pbs-server
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
    ports:
      - "15001:15001"  # PBS Mom
      - "15002:15002"  # PBS Server
    volumes:
      - pbs-config:/etc/pbs.conf
      - pbs-data:/var/spool/pbs
    networks:
      pbs-net:
        ipv4_address: 10.20.0.2

volumes:
  pbs-config:
  pbs-data:

networks:
  pbs-net:
    driver: bridge

HTCondor — High-Throughput Computing

HTCondor (formerly Condor) specializes in high-throughput computing — maximizing the total number of jobs completed over time, especially in environments with heterogeneous resources and opportunistic scheduling.

Key Features

  • Matchmaking: ClassAds system matches job requirements with machine capabilities
  • Opportunistic computing: Utilize idle desktop/workstation cycles
  • Flocking: Connect multiple Condor pools for resource sharing
  • Job checkpointing: Automatic checkpoint and migration of jobs
  • File transfer: Automatic input/output file staging
  • DAGMan: Workflow manager for complex job dependencies
  • High availability: Central manager failover support

Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# On Ubuntu/Debian
sudo apt update
sudo apt install -y htcondor

# Configure as central manager
sudo systemctl enable --now condor

# Set up the pool
sudo condor_config_val -write DAEMON_LIST "MASTER, COLLECTOR, NEGOTIATOR, SCHEDD"
sudo systemctl restart condor

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
version: "3.8"
services:
  condor-central:
    image: htcondor/execute:latest
    container_name: condor-central
    hostname: condor-central
    environment:
      - CONDOR_HOST=condor-central
      - COLLECTOR_HOST=condor-central
      - NEGOTIATOR_HOST=condor-central
    volumes:
      - condor-config:/etc/condor
      - condor-spool:/var/lib/condor/spool
    networks:
      condor-net:
        ipv4_address: 10.30.0.2

  condor-execute:
    image: htcondor/execute:latest
    container_name: condor-execute
    hostname: condor-execute
    environment:
      - CONDOR_HOST=condor-central
    depends_on:
      - condor-central
    networks:
      condor-net:
        ipv4_address: 10.30.0.3

volumes:
  condor-config:
  condor-spool:

networks:
  condor-net:
    driver: bridge

Choosing the Right Workload Manager

Use Slurm When:

  • You need the industry standard with the largest community
  • You manage a homogeneous HPC cluster with dedicated nodes
  • You need advanced scheduling features (backfill, preemption, QoS)
  • You require GPU and specialized resource scheduling

Use OpenPBS When:

  • You need flexible queue routing and fair-share policies
  • You want a mature, well-documented system with Python hooks
  • Your workload is primarily batch processing with predictable runtimes
  • You need strong integration with existing PBS workflows

Use HTCondor When:

  • You have heterogeneous resources (desktops, workstations, servers)
  • You want to opportunistically use idle compute cycles
  • You need sophisticated job matchmaking with ClassAds
  • Your focus is on throughput (total jobs completed) rather than latency

Why Self-Host Your HPC Infrastructure?

Running your own workload manager gives you complete control over job scheduling policies, resource allocation, and cluster configuration. Self-hosted HPC tools eliminate vendor lock-in and per-core licensing fees that commercial schedulers charge.

For container orchestration on smaller clusters, see our Kubernetes vs Docker Swarm vs Nomad comparison. If you’re managing Kubernetes clusters and need job scheduling within them, check our workflow orchestration guide. For server deployment and management automation, our Ansible UI comparison covers complementary tools.

FAQ

What is a workload manager?

A workload manager (also called a job scheduler or batch system) is software that manages the execution of computational jobs on a cluster. It receives job submissions, queues them, allocates resources (CPU, memory, nodes), schedules execution, and tracks job completion. Without a workload manager, users would need to manually coordinate which jobs run on which nodes and when.

Is Slurm free and open source?

Yes, Slurm is released under the GPLv2 license and is free to use. The core scheduler, resource manager, and all standard plugins are open source. Some advanced features and commercial support are available through SchedMD, but the open-source version is production-ready and powers the majority of the world’s fastest supercomputers.

Can I run HPC workloads on Docker containers?

Yes, all three workload managers support containerized jobs. Slurm has native container plugins for Singularity and Charliecloud. OpenPBS can launch Docker and Singularity containers through job hooks. HTCondor supports Docker universe jobs that automatically handle container lifecycle. For development and testing, you can also run the workload managers themselves in Docker (see configurations above).

How do I monitor my HPC cluster?

You can use Slurm-web for Slurm, the PBS qstat and showq commands for OpenPBS, or condor_status and condor_q for HTCondor. For broader infrastructure monitoring that covers the underlying nodes, tools like Netdata, Prometheus, and Zabbix can track node health, temperature, and resource utilization. See our GPU monitoring guide for hardware-level monitoring.

What’s the difference between HPC and HTC?

HPC (High Performance Computing) focuses on maximizing the speed of individual jobs — running large simulations or calculations as fast as possible using many cores. HTC (High Throughput Computing) focuses on maximizing the total number of jobs completed over time, even if individual jobs run slower. Slurm and OpenPBS are primarily HPC-oriented, while HTCondor specializes in HTC.

Can these tools manage GPU clusters?

Yes. Slurm has native GPU support through the Generic RESource (GRES) plugin, which tracks GPU availability and assigns GPUs to jobs. OpenPBS can manage GPUs as custom resources defined in the server configuration. HTCondor uses the machine’s ClassAd to advertise available GPUs and match them to jobs that require them.