In a shared Kubernetes cluster, not all workloads are equally important. Production databases should never be evicted to make room for batch jobs. Critical API gateways deserve guaranteed resources over experimental workloads. Kubernetes PriorityClasses and preemption provide the mechanisms to enforce these priorities — but configuring them correctly requires understanding how the scheduler, eviction logic, and resource quotas interact.
This guide covers Kubernetes priority-based scheduling, preemption policies, and practical tools for managing workload priorities in production clusters.
Understanding Kubernetes Pod Priority
Kubernetes assigns each pod a priority value (integer) via a PriorityClass. When the scheduler cannot find sufficient resources for a pending pod, it may evict (preempt) lower-priority pods to make room.
Priority Class Definition
| |
Applying Priority to Workloads
| |
Preemption: How Kubernetes Reclaims Resources
When a high-priority pod cannot be scheduled due to insufficient resources, the scheduler evaluates whether preempting lower-priority pods would free enough resources.
Preemption Algorithm
- Find nodes where the pending pod could fit if lower-priority pods were removed
- Select the node requiring the fewest preemptions (minimizes disruption)
- Evict the lowest-priority pods first
- Wait for evicted pods to terminate, then schedule the pending pod
Preemption Configuration
| |
The preemptionPolicy field (added in Kubernetes 1.16) controls both:
- Whether pods with this class can preempt lower-priority pods (
Never= cannot preempt) - Whether pods can be preempted by higher-priority pods (controlled by their own priority value)
Priority-Based Scheduling Strategies
Strategy 1: Tiered Priority Model
| |
Strategy 2: Resource Quota Integration
| |
Strategy 3: Namespace-Based Priority
| |
Kubernetes Priority Management Tools
kube-scheduler Configuration
The default scheduler handles priority and preemption. For advanced scenarios, you can customize the scheduler:
| |
Priority Class Validation Webhook
| |
Deployment: Priority-Aware Helm Chart
| |
Comparison: Priority Management Approaches
| Approach | Complexity | Flexibility | Best For |
|---|---|---|---|
| Tiered Priority Classes | Low | Medium | Most clusters |
| Resource Quota + Priority | Medium | High | Multi-tenant clusters |
| Namespace-Based Priority | Low | Low | Simple org structures |
| Custom Scheduler | High | Very High | Specialized workloads |
| Webhook Validation | Medium | High | Governance & compliance |
Why Self-Host Kubernetes with Priority Scheduling?
Priority-based scheduling is essential for any shared Kubernetes cluster where diverse workloads compete for limited resources. In self-hosted environments:
- No cloud scheduler overrides: Cloud providers sometimes impose their own scheduling policies. Self-hosted Kubernetes gives you full control over priority and preemption behavior.
- Cost optimization: Preempting batch workloads during peak production hours maximizes hardware utilization without over-provisioning.
- Compliance requirements: Regulated industries need guaranteed resource allocation for audit logging, security monitoring, and data protection workloads.
- Multi-tenant fairness: Priority classes ensure that critical tenant workloads always receive resources, while non-critical workloads yield during contention.
For Kubernetes resource management, see our resource quota guide and autoscaling comparison.
Best Practices for Priority Management
- Reserve high priority values for truly critical workloads — If everything is priority 1,000,000, nothing is
- Use preemptionPolicy: Never for batch workloads — Prevents cascading preemption storms
- Set globalDefault to a mid-range priority — Unspecified pods get reasonable priority without manual configuration
- Monitor preemption events — Use
kubectl get events --field-selector reason=Preemptedto track disruptions - Test preemption scenarios — Simulate resource contention in staging to verify priority behavior
Resource Contention Scenarios
Understanding how priority and preemption behave under resource pressure is critical for cluster operators. Here are common scenarios and expected behavior:
Scenario 1: Production database node failure — When a node hosting production databases fails, the scheduler attempts to reschedule those pods on remaining nodes. With production-critical priority (1,000,000), the scheduler will preempt batch-processing pods (100,000) to make room. If batch workloads have preemptionPolicy: Never, the database pods may remain Pending until new nodes are added or existing pods terminate naturally.
Scenario 2: Deploy-time resource spike — Rolling out a new version of a high-priority deployment may temporarily require 2x resources (old + new pods). If insufficient resources exist, the scheduler preemptively evicts lowest-priority pods. Using maxUnavailable in your Deployment strategy prevents this spike by terminating old pods before creating new ones.
Scenario 3: Node drain for maintenance — When draining a node with kubectl drain, pods are evicted respecting PodDisruptionBudgets. Priority classes don’t affect drain behavior — PDBs and the --grace-period flag control the eviction order.
Monitoring Preemption Events
| |
PodDisruptionBudget Integration
| |
PDBs protect high-priority workloads from voluntary disruptions (drain, cluster-autoscaler scale-down). They don’t prevent preemption by the scheduler — only voluntary evictions are blocked. For comprehensive protection, combine PriorityClasses with PDBs and resource quotas.
FAQ
What priority values should I use?
Kubernetes priority values range from 0 to 1,000,000,000 (1 billion). System-critical pods (kube-system namespace) use values above 2,000,000,000 internally. For user-defined classes, a common scheme is: system (1M+), production-critical (500K-1M), production-standard (100K-500K), batch (10K-100K), development (1-10K).
Can a pod be preempted if it has the highest priority?
No. Preemption only evicts pods with strictly lower priority values. If a pending pod has the highest priority in the cluster and still cannot be scheduled, it will remain Pending until resources become available through other means (pod termination, node addition).
What is the difference between priority and QoS class?
PriorityClass determines scheduling order and preemption behavior. QoS class (Guaranteed, Burstable, BestEffort) determines eviction order when a node is under resource pressure (memory pressure, disk pressure). They operate independently — a high-priority pod with BestEffort QoS can still be evicted by the kubelet during node pressure.
How do I prevent preemption storms?
Set preemptionPolicy: Never on low-priority workload classes. This prevents them from preempting even-lower-priority pods, which can cause cascading eviction chains. Additionally, use PodDisruptionBudgets to protect critical workloads from voluntary disruptions.
Does priority work with Vertical Pod Autoscaler (VPA)?
Yes, but with caveats. VPA adjusts resource requests/limits but doesn’t change pod priority. If VPA increases a pod’s resource requests beyond available capacity, the pod may be evicted and recreated with new resource values — at which point priority and preemption rules apply normally.
Can I change a pod’s priority after it’s running?
No. PriorityClassName is immutable for running pods. To change priority, you must delete and recreate the pod. However, you can update a PriorityClass’s value, which affects future scheduling decisions for pods using that class but doesn’t change the priority of already-running pods.