Infrastructure capacity planning is the art and science of ensuring you have enough compute, storage, and network resources to handle current and future workloads without overspending on idle hardware. Under-provisioning leads to performance degradation, service outages, and frustrated users. Over-provisioning wastes budget on servers that sit at 10 percent utilization. Self-hosted capacity planning tools help you strike the right balance by analyzing historical usage, simulating future scenarios, and providing actionable recommendations.
This guide compares three approaches to self-hosted capacity planning: Open Simulator (a Kubernetes cluster simulator from Alibaba), Cloud Custodian (a cloud resource management tool with capacity estimation capabilities), and K8s Resource Forecasting (using Prometheus metrics with custom forecasting scripts). Each serves different use cases from K8s cluster sizing to cloud cost optimization to workload-based capacity prediction.
The Capacity Planning Problem
Capacity planning answers three fundamental questions:
- Current state – How much of our infrastructure is being used right now? Which resources are bottlenecks?
- Growth trajectory – Based on historical trends, when will we run out of capacity?
- What-if scenarios – If we add 50 percent more users, deploy a new service, or migrate to a different cluster topology, how will our infrastructure handle it?
Without systematic capacity planning, organizations typically discover resource shortages only when services start failing. Proactive capacity planning shifts this from reactive firefighting to planned infrastructure growth.
Open Simulator: Kubernetes Cluster Simulation
Open Simulator, developed by Alibaba, is an open-source Kubernetes cluster simulator designed for capacity planning. It models the scheduling behavior of a K8s cluster and simulates how workloads would be distributed across nodes under various configurations.
How It Works
Open Simulator reads your current cluster state (node resources, pod requirements, scheduling constraints) and creates a virtual model. You can then modify the model, add or remove nodes, change pod resource requests, adjust scheduling policies, and simulate how the cluster would behave under the new configuration. The output includes scheduling feasibility reports, resource utilization projections, and bottleneck identification.
Docker Compose Configuration
| |
Simulation Configuration
| |
Running Simulations
| |
Key Features
- Scheduling simulation – Models K8s scheduler behavior including affinity/anti-affinity, topology spread constraints, and priority-based preemption
- Node failure modeling – Simulates node outages to verify cluster resilience
- Cost estimation – Estimates the cost impact of scaling decisions
- Visual reports – Generates HTML reports with resource utilization charts
Pros and Cons
Pros:
- Specifically designed for Kubernetes capacity planning
- Models actual K8s scheduling behavior, not just resource totals
- Supports what-if scenarios for cluster growth and failure modes
- Open source under Apache 2.0
Cons:
- Development has slowed (last significant commit in 2023)
- Limited documentation beyond basic examples
- Primarily focused on Alibaba Cloud – requires adaptation for other environments
- No built-in Prometheus integration for historical data
Cloud Custodian: Policy-Driven Resource Management
Cloud Custodian (by Fugue) is an open-source cloud resource management tool that enforces policies, optimizes costs, and provides capacity insights across multiple cloud providers. While not a dedicated capacity planning tool, its resource analysis and reporting capabilities make it valuable for capacity forecasting.
How It Works
Cloud Custodian reads cloud provider APIs to inventory resources, evaluate them against policy rules, and generate reports. Its capacity planning capabilities come from resource utilization analysis, idle resource detection, and right-sizing recommendations.
Docker Compose Configuration
| |
Capacity Planning Policy (capacity.yaml)
| |
Generating Capacity Reports
| |
Pros and Cons
Pros:
- Multi-cloud support (AWS, GCP, Azure, Kubernetes)
- 100+ built-in resource filters and actions
- Integrates with Slack, SNS, SQS, and other notification services
- Active development with large community (11,000+ GitHub stars)
- Can automate capacity optimization (right-size, terminate idle resources)
Cons:
- Not a dedicated capacity planning tool – optimization is a side benefit
- Requires cloud provider API access (not suitable for bare-metal-only environments)
- Policy authoring has a learning curve
- Historical analysis depends on cloud provider metrics retention periods
K8s Resource Forecasting with Prometheus
For Kubernetes clusters already running Prometheus, custom resource forecasting provides the most accurate capacity planning by analyzing actual historical metrics rather than simulated models.
Architecture
The flow is straightforward: Prometheus collects metrics from node_exporter and kube-state-metrics, a custom Python script queries the Prometheus API for historical trends, and the script generates a forecast report with capacity exhaustion predictions.
Prometheus Queries for Capacity Analysis
| |
Prometheus Alerting Rules for Capacity
| |
Forecasting Script
| |
Pros and Cons
Pros:
- Uses real historical metrics – most accurate forecasting method
- Customizable forecasting models (linear, exponential, seasonal)
- Integrates with existing Prometheus/Grafana stack
- Real-time alerting when capacity thresholds are approached
- No additional software beyond Prometheus and Python
Cons:
- Requires Prometheus to be already deployed and collecting metrics
- Forecasting accuracy depends on historical data quality and quantity
- Custom scripting needed (no out-of-the-box solution)
- Does not model K8s scheduling behavior (unlike Open Simulator)
Comparison Table
| Feature | Open Simulator | Cloud Custodian | K8s Resource Forecasting |
|---|---|---|---|
| Approach | Cluster simulation | Policy-driven analysis | Historical metrics forecasting |
| K8s Scheduling Model | Yes | No | No |
| Multi-Cloud | No (Alibaba-focused) | Yes (AWS, GCP, Azure, K8s) | K8s only (with Prometheus) |
| What-If Scenarios | Yes | Limited | Limited |
| Historical Data | No | Cloud provider metrics | Full Prometheus history |
| Automation | Manual simulation | Policy enforcement | Alert-based |
| Cost Estimation | Yes | Yes | Manual calculation |
| Bare-Metal Support | Limited | No | Yes (via node_exporter) |
| Active Development | Slow (2023) | Yes (active) | Custom (your scripts) |
| GitHub Stars | 267+ | 11,000+ | N/A (custom) |
| Best For | K8s cluster sizing | Multi-cloud optimization | K8s capacity forecasting |
Why Self-Host Capacity Planning?
Capacity planning tools that run in your infrastructure have several advantages over SaaS alternatives. First, they have direct access to your resource metrics without requiring API keys or data sharing with third parties. Second, they can model your specific infrastructure topology, including private networks, on-premises hardware, and hybrid cloud deployments that SaaS tools cannot see. Third, for regulated industries such as finance, healthcare, and government, keeping capacity data on-premises is often a compliance requirement.
When combined with infrastructure drift detection tools, capacity planning becomes part of a broader infrastructure governance strategy. Understanding your Kubernetes batch scheduling patterns helps predict peak resource demands, while network bandwidth monitoring reveals whether network capacity is keeping pace with compute growth.
Choosing the Right Capacity Planning Approach
For pure Kubernetes environments, Open Simulator provides the most accurate capacity predictions because it models the actual K8s scheduler. If you need to understand how a new deployment will affect pod placement across nodes, simulation is the only reliable approach.
For multi-cloud environments, Cloud Custodian offers the broadest coverage. Its ability to analyze resources across AWS, GCP, and Azure, plus Kubernetes, makes it the right choice for organizations with hybrid infrastructure.
For K8s clusters with Prometheus already deployed, custom resource forecasting provides the most accurate predictions because it uses real historical data rather than simulations or snapshots. The forecasting accuracy improves with more historical data, making this approach increasingly valuable over time.
FAQ
What is the difference between capacity planning and capacity management?
Capacity planning is forward-looking – it predicts future resource needs based on growth trends and planned changes. Capacity management is present-focused – it monitors current resource utilization and ensures services have enough capacity right now. Capacity planning uses historical data to forecast while capacity management uses real-time metrics to alert.
How far in advance should I plan capacity?
The planning horizon depends on your procurement cycle. For cloud environments with on-demand scaling, 30 to 90 days is typical. For on-premises infrastructure that requires hardware procurement and deployment, 6 to 12 months is more realistic. Most organizations benefit from maintaining both a short-term 30-day tactical plan and a long-term 12-month strategic plan.
Can Open Simulator work with non-Alibaba Kubernetes clusters?
Yes. Open Simulator reads Kubernetes cluster state via the standard kubeconfig file and the K8s API. While it was developed by Alibaba and has some Alibaba Cloud-specific features, the core simulation engine works with any standards-compliant Kubernetes cluster.
How accurate is Prometheus-based capacity forecasting?
Forecasting accuracy depends on the quality and quantity of historical data. With 30 or more days of continuous metrics, linear regression can predict resource exhaustion within 10 to 15 percent accuracy for stable workloads. For seasonal workloads such as e-commerce with holiday spikes, you need at least 12 months of data for accurate seasonal decomposition. Sudden workload changes like new product launches or viral traffic cannot be predicted by any historical method.
Does Cloud Custodian work with on-premises infrastructure?
No. Cloud Custodian connects to cloud provider APIs (AWS, GCP, Azure) and Kubernetes clusters. It does not support bare-metal servers, VMware, or other on-premises virtualization platforms. For on-premises capacity planning, Open Simulator or Prometheus-based forecasting are better choices.
How do I set up capacity alerts before resources are exhausted?
Set alerting thresholds at 70 to 80 percent utilization for CPU, memory, and disk, not at 90 percent or higher. The 70 percent threshold gives you time to provision additional resources before hitting critical levels. For Kubernetes pod capacity, alert at 80 percent of node pod limits, as scheduling becomes increasingly difficult as nodes approach their pod capacity ceiling.
What metrics should I track for capacity planning?
The essential metrics are CPU utilization per node and per namespace, memory utilization, disk I/O and capacity, network bandwidth, pod count per node, and request-to-limit ratios. For database workloads, also track connection count, query latency, and replication lag. For storage, track IOPS, throughput, and latency percentiles such as p95 and p99.