Finding the right hyperparameters for a model can make the difference between mediocre and state-of-the-art performance. Manual tuning is slow and error-prone. Automated hyperparameter optimization (HPO) frameworks systematically search the parameter space, finding better configurations with fewer training runs.
This guide compares the three leading open-source HPO frameworks for self-hosted deployment: Optuna (Preferred Networks), Ray Tune (Anyscale/Ray project), and Hyperopt (Montreal Institute for Learning Algorithms). We’ll cover search algorithms, distributed execution, Docker deployment configurations, and help you choose the right tool for your training infrastructure.
For the training frameworks that these HPO tools optimize, see our distributed training comparison. If you need to track HPO experiments alongside training runs, our experiment tracking guide covers the best tools. And for monitoring GPU utilization during HPO sweeps, check our GPU monitoring guide.
Why Automated Hyperparameter Optimization
Hyperparameters — learning rates, batch sizes, network depths, regularization strengths — control how a model learns. Getting them right is critical, but the search space is often enormous. A neural network with 10 tunable hyperparameters, each with just 5 candidate values, has nearly 10 million possible combinations.
Automated HPO frameworks address this challenge through:
- Intelligent search — Bayesian optimization and tree-structured Parzen estimators find good configurations faster than grid search
- Early stopping — Pruning bad trials mid-training saves compute resources
- Parallel execution — Running multiple configurations simultaneously across GPUs or nodes
- Reproducibility — Tracking every trial’s parameters, metrics, and outcomes
Optuna: Tree-Structured Bayesian Optimization
Optuna is a lightweight, Python-native HPO framework that uses Tree-Structured Parzen Estimator (TPE) for efficient parameter search. Its define-by-run API makes it easy to express complex search spaces with conditional parameters.
Architecture
Optuna uses a central Study object that manages trials. Each trial samples hyperparameters using TPE, which builds probability distributions over good and bad parameter configurations and samples from the better region. Optuna also supports CMA-ES (Covariance Matrix Adaptation Evolution Strategy) for continuous parameter spaces.
| |
Docker Deployment
| |
Search Space Definition
| |
Key Strengths
- Define-by-run API for dynamic search spaces
- Built-in pruning (Median, Hyperband, Patient, Percentile)
- Web dashboard for real-time visualization
- PostgreSQL/MySQL/SQLite storage backends
- Multi-objective optimization
- Integration with PyTorch, TensorFlow, scikit-learn, XGBoost, LightGBM
Ray Tune: Distributed Hyperparameter Search at Scale
Ray Tune is part of the Ray distributed computing ecosystem. It provides scalable hyperparameter search with support for multiple search algorithms, schedulers, and seamless integration with the broader Ray ecosystem (data processing, serving, reinforcement learning).
Architecture
Ray Tune runs on top of the Ray core distributed runtime. It manages trials across a cluster of nodes, each running on separate CPUs or GPUs. Tune integrates with external search algorithms (Optuna, Hyperopt, Ax, Nevergrad, BOHB) as drop-in samplers.
| |
Docker Deployment
| |
Search Configuration
| |
Key Strengths
- Massive scalability — thousands of trials across hundreds of nodes
- Built-in cluster management with Ray autoscaler
- Multiple scheduler algorithms (HyperBand, ASHA, PBT, FIFO)
- Integrates with Optuna, Hyperopt, Ax, Nevergrad, BOHB as search backends
- Population-Based Training (PBT) for dynamic hyperparameter adjustment
- Ray Dashboard for real-time monitoring
- Native integration with Ray Serve for model serving
Hyperopt: Bayesian Optimization Foundation
Hyperopt is one of the earliest HPO frameworks, implementing Sequential Model-Based Optimization (SMBO) with Tree-Structured Parzen Estimators. It provides the foundational algorithms that many newer frameworks build upon.
Architecture
Hyperopt uses a fmin interface where you define an objective function and a search space using its domain-specific language (DSL). The TPE algorithm maintains two distributions — one over parameters that yielded good results and one over all parameters — and samples from the ratio of these distributions.
| |
Docker Deployment
| |
Search Space Definition
| |
Key Strengths
- Lightweight, minimal dependencies
- TPE algorithm — the foundation for Optuna and Ray Tune’s Optuna integration
- MongoDB-based distributed execution
- Simple search space DSL
- Random search baseline for comparison
- Integration with scikit-learn, Keras, PyTorch
Comparison Table
| Feature | Optuna | Ray Tune | Hyperopt |
|---|---|---|---|
| Search Algorithms | TPE, CMA-ES, Grid, Random | TPE (via Optuna), Hyperopt, Ax, Nevergrad, BOHB, Random, Grid | TPE, Random |
| Pruning/Scheduling | Median, Hyperband, Patient, Percentile | ASHA, HyperBand, PBT, FIFO, Median | None (manual) |
| Distributed Execution | RDB-based (PostgreSQL, MySQL) | Ray cluster (native distributed) | MongoDB-based |
| Web Dashboard | Optuna Dashboard | Ray Dashboard (built-in) | None |
| Multi-Objective | Yes (NSGA-II, Motpe) | Yes | No |
| Search Space API | Define-by-run (Python) | Dict-based (tune.*) | DSL (hp.*) |
| Framework Integration | PyTorch, TF, sklearn, XGBoost, LightGBM | PyTorch, TF, sklearn, XGBoost, JAX, RLlib | PyTorch, TF, Keras, sklearn |
| Population-Based Training | No | Yes (PBT) | No |
| Cold Start | Random init | Random init | Random init |
| Ease of Setup | Simple (pip + DB) | Moderate (Ray cluster) | Moderate (MongoDB) |
| Scalability | Medium (DB-bounded) | High (Ray cluster) | Medium (Mongo-bounded) |
| Active Maintainer | Preferred Networks | Anyscale / Ray community | MILA / community |
| GitHub Stars | 14,100+ | 42,400+ (Ray) | 7,600+ |
| Last Update | 2026-05 | 2026-05 | 2026-03 |
Choosing the Right HPO Framework
Choose Optuna when:
- You want the best balance of features and simplicity
- You need a web dashboard for monitoring trials
- Your search space has conditional parameters (define-by-run API excels here)
- You’re running on a single node or small cluster with a shared database
- You need multi-objective optimization
Choose Ray Tune when:
- You need to run hundreds or thousands of trials across a GPU cluster
- You want Population-Based Training for dynamic hyperparameter adjustment
- You’re already using Ray for data processing or model serving
- You need the flexibility to swap between multiple search algorithms
- You want built-in autoscaling for cloud or on-premise clusters
Choose Hyperopt when:
- You want a lightweight dependency with minimal setup
- You’re implementing custom search algorithms on top of TPE
- You need a proven, battle-tested algorithm as a baseline
- Your infrastructure already runs MongoDB for other services
Why Self-Host Hyperparameter Optimization
Cost Efficiency: HPO runs many trials, each consuming GPU hours. Cloud GPU costs multiply quickly — 200 trials at 2 hours each on an A100 instance can cost thousands of dollars. Self-hosted clusters amortize hardware costs across thousands of experiments.
Privacy: Hyperparameter search results reveal information about your models and datasets. Self-hosted HPO keeps trial histories, configurations, and performance metrics within your infrastructure.
Unlimited Trials: Cloud providers impose quotas on GPU usage and concurrent instances. Self-hosted clusters let you run HPO sweeps continuously without hitting rate limits or quota walls.
For the broader ML pipeline that HPO fits into, see our ML pipeline orchestration guide for automating the full training-to-deployment workflow, and our feature store comparison for feature engineering before HPO.
FAQ
What is the difference between grid search, random search, and Bayesian optimization?
Grid search exhaustively tries every combination of hyperparameters — efficient only when the search space is tiny. Random search samples randomly from the parameter space and often outperforms grid search with fewer trials because it doesn’t waste computation on irrelevant parameters. Bayesian optimization (TPE, used by all three frameworks) builds a probabilistic model of the objective function and uses it to select the most promising configurations, typically finding better results with fewer trials than random search.
How many trials do I need for effective hyperparameter optimization?
As a starting point, 50-100 trials is usually enough to find a good configuration for most models. For complex models with many hyperparameters, 200-500 trials may be needed. The exact number depends on the search space size and the sensitivity of your model to hyperparameter changes. Use pruning to stop unpromising trials early and save compute.
Can I resume an HPO study after the cluster restarts?
Yes — all three frameworks support persistence. Optuna uses RDB storage (PostgreSQL, MySQL, SQLite) that persists across restarts. Ray Tune saves checkpoints to disk and can resume from the latest state. Hyperopt with MongoDB storage maintains trial state in the database.
Should I use the same HPO framework for all my projects?
Not necessarily. For single-node experiments with quick iterations, Optuna’s simplicity is ideal. For large-scale cluster searches, Ray Tune’s distributed execution is worth the setup complexity. Hyperopt works well as a lightweight baseline or when you need to implement custom search algorithms.
How does pruning save compute during HPO?
Pruning stops trials early when their intermediate metrics are worse than previous trials. For example, if trial #50’s validation loss after 5 epochs is higher than the median of the best 50% of trials at epoch 5, it gets terminated. This can save 30-60% of compute by not wasting GPU hours on configurations that clearly won’t converge to good results.
Can I use HPO with distributed training frameworks?
Yes — HPO and distributed training are complementary. Each HPO trial can itself use distributed training (e.g., DeepSpeed or FSDP across multiple GPUs). However, this multiplies resource requirements. A common pattern is to use single-GPU trials for HPO to find good hyperparameters, then scale up to distributed training for the final model training run.
What is Population-Based Training (PBT) and when should I use it?
PBT runs multiple trials in parallel and periodically replaces underperforming configurations with perturbed copies of better ones. Unlike standard HPO where each trial runs independently, PBT allows trials to learn from each other. It’s particularly useful when you want to find not just good hyperparameters but also good learning rate schedules or training dynamics. Only Ray Tune supports PBT natively.