In a shared Kubernetes cluster, not all workloads are equally important. Production databases should never be evicted to make room for batch jobs. Critical API gateways deserve guaranteed resources over experimental workloads. Kubernetes PriorityClasses and preemption provide the mechanisms to enforce these priorities — but configuring them correctly requires understanding how the scheduler, eviction logic, and resource quotas interact.

This guide covers Kubernetes priority-based scheduling, preemption policies, and practical tools for managing workload priorities in production clusters.

Understanding Kubernetes Pod Priority

Kubernetes assigns each pod a priority value (integer) via a PriorityClass. When the scheduler cannot find sufficient resources for a pending pod, it may evict (preempt) lower-priority pods to make room.

Priority Class Definition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
description: "Production-critical workloads that should never be preempted"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-standard
value: 500000
globalDefault: true
description: "Standard production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-processing
value: 100000
globalDefault: false
description: "Batch and offline processing workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: development
value: 1000
globalDefault: false
description: "Development and experimental workloads"

Applying Priority to Workloads

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      priorityClassName: production-critical
      containers:
      - name: gateway
        image: envoyproxy/envoy:v1.29
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

Preemption: How Kubernetes Reclaims Resources

When a high-priority pod cannot be scheduled due to insufficient resources, the scheduler evaluates whether preempting lower-priority pods would free enough resources.

Preemption Algorithm

  1. Find nodes where the pending pod could fit if lower-priority pods were removed
  2. Select the node requiring the fewest preemptions (minimizes disruption)
  3. Evict the lowest-priority pods first
  4. Wait for evicted pods to terminate, then schedule the pending pod

Preemption Configuration

1
2
3
4
5
6
7
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-no-preempt
value: 100000
preemptionPolicy: Never
description: "Batch workloads that should not preempt others and cannot be preempted"

The preemptionPolicy field (added in Kubernetes 1.16) controls both:

  • Whether pods with this class can preempt lower-priority pods (Never = cannot preempt)
  • Whether pods can be preempted by higher-priority pods (controlled by their own priority value)

Priority-Based Scheduling Strategies

Strategy 1: Tiered Priority Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Infrastructure tier
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: system-critical
value: 2000000
globalDefault: false

# Production tier
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: prod-high
value: 1000000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: prod-standard
value: 500000
globalDefault: true
---
# Non-production tier
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: staging
value: 100000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: dev-test
value: 10000
globalDefault: false

Strategy 2: Resource Quota Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    limits.cpu: "200"
    limits.memory: "400Gi"
    pods: "500"
    count/priorityclasses.scheduling.k8s.io: "1000000"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: prod-limits
  namespace: production
spec:
  limits:
  - type: Container
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    default:
      cpu: "1"
      memory: "1Gi"

Strategy 3: Namespace-Based Priority

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Assign default priority class per namespace via LimitRange
apiVersion: v1
kind: LimitRange
metadata:
  name: default-priority
  namespace: batch-jobs
spec:
  limits:
  - type: Pod
    defaultPriorityClassName: batch-processing

Kubernetes Priority Management Tools

kube-scheduler Configuration

The default scheduler handles priority and preemption. For advanced scenarios, you can customize the scheduler:

1
2
3
4
5
6
7
8
9
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: custom-scheduler
  pluginConfig:
  - name: Preemption
    args:
      minCandidateNodesPercentage: 10
      minCandidateNodesAbsolute: 100

Priority Class Validation Webhook

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: priority-validator
webhooks:
- name: priority.check.example.com
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  clientConfig:
    service:
      name: priority-webhook
      namespace: kube-system
      path: /validate

Deployment: Priority-Aware Helm Chart

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# values.yaml
priority:
  enabled: true
  classes:
    - name: production-critical
      value: 1000000
      globalDefault: false
    - name: batch
      value: 100000
      preemptionPolicy: Never

# templates/priorityclass.yaml
{{- if .Values.priority.enabled }}
{{- range .Values.priority.classes }}
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: {{ .name }}
value: {{ .value }}
{{- if .globalDefault }}
globalDefault: {{ .globalDefault }}
{{- end }}
{{- if .preemptionPolicy }}
preemptionPolicy: {{ .preemptionPolicy }}
{{- end }}
---
{{- end }}
{{- end }}

Comparison: Priority Management Approaches

ApproachComplexityFlexibilityBest For
Tiered Priority ClassesLowMediumMost clusters
Resource Quota + PriorityMediumHighMulti-tenant clusters
Namespace-Based PriorityLowLowSimple org structures
Custom SchedulerHighVery HighSpecialized workloads
Webhook ValidationMediumHighGovernance & compliance

Why Self-Host Kubernetes with Priority Scheduling?

Priority-based scheduling is essential for any shared Kubernetes cluster where diverse workloads compete for limited resources. In self-hosted environments:

  • No cloud scheduler overrides: Cloud providers sometimes impose their own scheduling policies. Self-hosted Kubernetes gives you full control over priority and preemption behavior.
  • Cost optimization: Preempting batch workloads during peak production hours maximizes hardware utilization without over-provisioning.
  • Compliance requirements: Regulated industries need guaranteed resource allocation for audit logging, security monitoring, and data protection workloads.
  • Multi-tenant fairness: Priority classes ensure that critical tenant workloads always receive resources, while non-critical workloads yield during contention.

For Kubernetes resource management, see our resource quota guide and autoscaling comparison.

Best Practices for Priority Management

  1. Reserve high priority values for truly critical workloads — If everything is priority 1,000,000, nothing is
  2. Use preemptionPolicy: Never for batch workloads — Prevents cascading preemption storms
  3. Set globalDefault to a mid-range priority — Unspecified pods get reasonable priority without manual configuration
  4. Monitor preemption events — Use kubectl get events --field-selector reason=Preempted to track disruptions
  5. Test preemption scenarios — Simulate resource contention in staging to verify priority behavior

Resource Contention Scenarios

Understanding how priority and preemption behave under resource pressure is critical for cluster operators. Here are common scenarios and expected behavior:

Scenario 1: Production database node failure — When a node hosting production databases fails, the scheduler attempts to reschedule those pods on remaining nodes. With production-critical priority (1,000,000), the scheduler will preempt batch-processing pods (100,000) to make room. If batch workloads have preemptionPolicy: Never, the database pods may remain Pending until new nodes are added or existing pods terminate naturally.

Scenario 2: Deploy-time resource spike — Rolling out a new version of a high-priority deployment may temporarily require 2x resources (old + new pods). If insufficient resources exist, the scheduler preemptively evicts lowest-priority pods. Using maxUnavailable in your Deployment strategy prevents this spike by terminating old pods before creating new ones.

Scenario 3: Node drain for maintenance — When draining a node with kubectl drain, pods are evicted respecting PodDisruptionBudgets. Priority classes don’t affect drain behavior — PDBs and the --grace-period flag control the eviction order.

Monitoring Preemption Events

1
2
3
4
5
6
7
8
# List recent preemption events
kubectl get events --sort-by='.lastTimestamp' --field-selector reason=Preempted

# Check which pods were preempted and why
kubectl describe pod <pod-name> | grep -A5 "Preempted"

# View preemption decisions in scheduler logs
kubectl logs -n kube-system deploy/kube-scheduler | grep -i preempt

PodDisruptionBudget Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-gateway-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-gateway

PDBs protect high-priority workloads from voluntary disruptions (drain, cluster-autoscaler scale-down). They don’t prevent preemption by the scheduler — only voluntary evictions are blocked. For comprehensive protection, combine PriorityClasses with PDBs and resource quotas.

FAQ

What priority values should I use?

Kubernetes priority values range from 0 to 1,000,000,000 (1 billion). System-critical pods (kube-system namespace) use values above 2,000,000,000 internally. For user-defined classes, a common scheme is: system (1M+), production-critical (500K-1M), production-standard (100K-500K), batch (10K-100K), development (1-10K).

Can a pod be preempted if it has the highest priority?

No. Preemption only evicts pods with strictly lower priority values. If a pending pod has the highest priority in the cluster and still cannot be scheduled, it will remain Pending until resources become available through other means (pod termination, node addition).

What is the difference between priority and QoS class?

PriorityClass determines scheduling order and preemption behavior. QoS class (Guaranteed, Burstable, BestEffort) determines eviction order when a node is under resource pressure (memory pressure, disk pressure). They operate independently — a high-priority pod with BestEffort QoS can still be evicted by the kubelet during node pressure.

How do I prevent preemption storms?

Set preemptionPolicy: Never on low-priority workload classes. This prevents them from preempting even-lower-priority pods, which can cause cascading eviction chains. Additionally, use PodDisruptionBudgets to protect critical workloads from voluntary disruptions.

Does priority work with Vertical Pod Autoscaler (VPA)?

Yes, but with caveats. VPA adjusts resource requests/limits but doesn’t change pod priority. If VPA increases a pod’s resource requests beyond available capacity, the pod may be evicted and recreated with new resource values — at which point priority and preemption rules apply normally.

Can I change a pod’s priority after it’s running?

No. PriorityClassName is immutable for running pods. To change priority, you must delete and recreate the pod. However, you can update a PriorityClass’s value, which affects future scheduling decisions for pods using that class but doesn’t change the priority of already-running pods.