← Back to posts
comparison guide self-hosted · · 10 min read

BentoML vs Seldon Core vs Triton vs Ray Serve: Best ML Model Serving Platforms 2026

Compare BentoML, Seldon Core, NVIDIA Triton, and Ray Serve for self-hosted machine learning model serving. Feature matrix, Docker deployment guides, and performance recommendations.

OS
Editorial Team

Deploying trained machine learning models into production is one of the most critical steps in any ML pipeline. While cloud platforms offer managed inference services, self-hosting your model serving infrastructure gives you full control over costs, data privacy, latency, and scalability.

In this guide, we compare four leading open-source model serving platforms — BentoML, Seldon Core, NVIDIA Triton Inference Server, and Ray Serve — to help you choose the right solution for your self-hosted deployment.

For related reading, see our ML experiment tracking comparison and ML feature store guide for a complete MLOps stack overview.

Why Self-Host Your Model Serving Infrastructure?

Running your own model serving platform offers several advantages over managed cloud alternatives:

  • Cost control: No per-inference API charges or GPU markup. Your hardware costs remain fixed regardless of request volume.
  • Data privacy: Sensitive data never leaves your infrastructure, critical for healthcare, finance, and regulated industries.
  • Latency optimization: Deploy inference servers close to your application servers or end users, eliminating cross-cloud network hops.
  • No vendor lock-in: Open-source serving frameworks let you swap models, frameworks, or hardware without rewriting your deployment code.
  • Full observability: Instrument metrics, logging, and tracing with your existing monitoring stack (Prometheus, Grafana, Jaeger).

BentoML — Unified Model Serving Framework

BentoML (8,593 ★) is a Python-first model serving framework that simplifies the process of packaging and deploying models from any framework. It provides a unified API layer that abstracts away framework-specific complexities.

Key Features

  • Framework agnostic: Supports PyTorch, TensorFlow, scikit-learn, XGBoost, Hugging Face, and more through a single API.
  • Adaptive batching: Automatically groups incoming requests to maximize GPU utilization without increasing per-request latency.
  • Model composition: Chain multiple models into pipelines (e.g., preprocessing → inference → postprocessing) as a single service.
  • Built-in API server: Ships with an HTTP/REST and gRPC server out of the box — no reverse proxy required for basic deployments.
  • Bento packaging: Models are bundled with their dependencies into versioned “Bentos” that are reproducible and portable.

Quick Start

1
pip install bentoml

Create a service definition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import bentoml
from bentoml.io import JSON

# Load a pre-trained model from the BentoML model store
model_ref = bentoml.models.get("my_sklearn_model:latest")

@bentoml.service(resources={"cpu": "2"})
class MyModelService:
    def __init__(self):
        self.model = model_ref.load_model()

    @bentoml.api
    def predict(self, input: JSON) -> JSON:
        features = input.data()["features"]
        result = self.model.predict([features])
        return JSON({"prediction": result.tolist()})

Run the service locally:

1
bentoml serve service:MyModelService --port 3000

Docker Deployment

BentoML can containerize any service into a production-ready Docker image:

1
bentoml containerize service:MyModelService -t my-model:latest

Or use a manual Dockerfile:

1
2
3
4
5
6
FROM python:3.11-slim
RUN pip install bentoml scikit-learn numpy
COPY . /app
WORKDIR /app
EXPOSE 3000
CMD ["bentoml", "serve", "service:MyModelService", "--host", "0.0.0.0"]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# docker-compose.yml
version: "3.8"
services:
  bentoml-server:
    image: my-model:latest
    ports:
      - "3000:3000"
    environment:
      - BENTOML_NUM_THREADS=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Seldon Core — Kubernetes-Native MLOps Platform

Seldon Core (4,745 ★) is a Kubernetes-native platform for deploying machine learning models at scale. It wraps your models in a standardized microservice architecture and provides advanced orchestration features.

Key Features

  • Kubernetes-native: Deploys as custom resources (SeldonDeployments) managed by Kubernetes operators.
  • Advanced routing: Supports canary deployments, A/B testing, and shadow mode for model comparisons.
  • Multi-framework support: Python wrappers, Java, R, and custom Docker images through the Seldon Core microservice protocol.
  • Built-in monitoring: Exports Prometheus metrics for request counts, latency, and payload distributions.
  • Autoscaling: Integrates with Kubernetes HPA and KEDA for request-driven scaling.

Deployment Example

Define a SeldonDeployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
  namespace: ml-system
spec:
  predictors:
  - name: default
    graph:
      name: classifier
      type: MODEL
      modelUri: gs://my-bucket/sklearn-model
      parameters:
        - name: method
          type: STRING
          value: predict
    replicas: 2
    componentSpecs:
    - spec:
        containers:
        - name: classifier
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "1Gi"
              nvidia.com/gpu: 1

Apply to your cluster:

1
2
kubectl apply -f seldon-deployment.yaml
kubectl get sdep my-model -n ml-system

When to Choose Seldon Core

Seldon Core excels when you already run Kubernetes and need enterprise-grade ML deployment features: progressive delivery, multi-model routing, and deep Kubernetes integration. It is the most complex option to set up but offers the richest feature set for production MLOps.

NVIDIA Triton Inference Server — High-Performance GPU Serving

NVIDIA Triton (10,593 ★) is a high-performance inference server optimized for GPU workloads. It supports multiple model frameworks simultaneously and provides advanced features like dynamic batching and concurrent model execution.

Key Features

  • Multi-framework support: Run PyTorch, TensorFlow, TensorRT, ONNX Runtime, and OpenVINO models side-by-side in a single server.
  • Concurrent model execution: Serve multiple models or model versions simultaneously on the same GPU.
  • Dynamic batching: Configurable batching policies that group requests to maximize GPU throughput.
  • Model ensemble: Define directed acyclic graphs (DAGs) of model steps that execute in a single server round-trip.
  • Hardware optimization: Deep NVIDIA GPU optimizations with TensorRT integration for maximum inference throughput.

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# docker-compose.yml
version: "3.8"
services:
  triton-server:
    image: nvcr.io/nvidia/tritonserver:24.05-py3
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # gRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./model_repository:/models
    command: >
      tritonserver
      --model-repository=/models
      --strict-model-config=false
      --http-port=8000
      --grpc-port=8001
      --metrics-port=8002
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Model repository structure:

1
2
3
4
5
6
7
8
9
model_repository/
├── my_pytorch_model/
│   ├── 1/
│   │   └── model.pt
│   └── config.pbtxt
└── my_onnx_model/
    ├── 1/
    │   └── model.onnx
    └── config.pbtxt

Example config.pbtxt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
name: "my_pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}

When to Choose Triton

Triton is the best choice when raw GPU inference performance is your top priority. It is particularly strong for computer vision, speech recognition, and other GPU-heavy workloads where TensorRT optimization and concurrent model execution deliver measurable throughput gains.

Ray Serve — Scalable Distributed Model Serving

Ray Serve (part of the Ray ecosystem, 42,262 ★) is a scalable model serving library built on top of the Ray distributed computing framework. It enables you to deploy models across multiple machines with automatic load balancing and scaling.

Key Features

  • Distributed by design: Deploy models across a Ray cluster spanning multiple nodes with automatic request routing.
  • Model composition: Build complex pipelines with multiple models, each independently scalable.
  • Python-native: Write serving logic in pure Python with familiar async/await patterns.
  • Framework agnostic: Works with any Python ML framework — PyTorch, TensorFlow, scikit-learn, and more.
  • Auto-scaling: Built-in scaling policies that add or remove replicas based on queue depth and latency targets.

Quick Start

1
pip install "ray[serve]"

Define and deploy a service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from ray import serve
import pickle
import requests

@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 1})
class ModelDeployment:
    def __init__(self, model_path: str):
        with open(model_path, "rb") as f:
            self.model = pickle.load(f)

    def __call__(self, request):
        data = request.query_params.get("input")
        result = self.model.predict([float(data)])
        return {"prediction": float(result[0])}

# Deploy
serve.run(ModelDeployment.bind(model_path="/models/model.pkl"))

Docker Compose Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# docker-compose.yml
version: "3.8"
services:
  ray-head:
    image: rayproject/ray:2.31.0-py311
    ports:
      - "8265:8265"   # Ray Dashboard
      - "10001:10001" # Client Port
      - "8000:8000"   # Serve HTTP
    command: >
      ray start --head
      --port=6379
      --dashboard-host=0.0.0.0
      --block
    volumes:
      - ./models:/models
      - ./deploy.py:/deploy.py
    deploy:
      resources:
        limits:
          memory: 4G

  ray-worker:
    image: rayproject/ray:2.31.0-py311
    command: >
      ray start
      --address=ray-head:6379
      --num-cpus=4
      --block
    deploy:
      replicas: 2
      resources:
        limits:
          memory: 4G

Deploy the model:

1
2
3
docker compose up -d
# Then exec into the head node:
docker compose exec ray-head python /deploy.py

Feature Comparison

FeatureBentoMLSeldon CoreTritonRay Serve
Primary environmentStandalone / DockerKubernetesDocker / Bare metalRay cluster
Framework supportAny (Python)Any (wrappers)PT, TF, TRT, ONNXAny (Python)
GPU supportYesYes (via K8s)Excellent (native)Yes
Dynamic batchingAdaptivePluginConfigurableCustom
Model compositionNative (pipelines)Multi-step graphsEnsemble DAGNative (pipelines)
AutoscalingManualK8s HPA/KEDAManualBuilt-in
Canary deploymentsManualNativeManualManual
API protocolsREST, gRPCREST, gRPCREST, gRPC, HTTP/2REST
MonitoringPrometheusPrometheusPrometheusRay Dashboard
Multi-node servingNoYes (K8s)No (single server)Yes (native)
Setup complexityLowHighMediumMedium
Best forQuick deploymentK8s environmentsGPU performanceDistributed serving

Which Platform Should You Choose?

The right choice depends on your infrastructure and workload:

  • Choose BentoML if you want the fastest path from trained model to production API. Its Python-first design and built-in server mean you can go from a model file to a running HTTP endpoint in minutes. Ideal for teams that need simplicity without sacrificing flexibility.

  • Choose Seldon Core if you are running Kubernetes and need enterprise-grade deployment features: canary releases, A/B testing, and deep integration with your cluster’s observability stack. It is the most feature-complete option but requires the most infrastructure.

  • Choose NVIDIA Triton if GPU inference throughput is your primary concern. Its multi-framework support, concurrent model execution, and TensorRT integration make it the performance leader for GPU-bound workloads.

  • Choose Ray Serve if you need to scale model serving across multiple machines. Its distributed architecture and integration with the broader Ray ecosystem (Ray Data, Ray Train) make it ideal for large-scale deployments.

FAQ

What is the difference between model training and model serving?

Model training is the process of teaching a model to recognize patterns in data, typically done offline with large datasets. Model serving (inference) is the process of using a trained model to make predictions on new data in real time. Serving requires low-latency, high-throughput infrastructure optimized for repeated prediction requests rather than batch computation.

Can I serve models from different frameworks on the same server?

Yes. Triton excels at this — it can run PyTorch, TensorFlow, ONNX, and TensorRT models simultaneously in a single server process. BentoML and Ray Serve also support multiple frameworks through their Python APIs. Seldon Core supports this through its standardized microservice wrapper pattern.

Do I need a GPU to self-host model serving?

Not necessarily. CPU-based serving works well for lightweight models like scikit-learn, decision trees, and small neural networks. GPU acceleration becomes essential for deep learning models (especially transformers and computer vision) where inference latency on CPU would be unacceptable for real-time applications. Triton and Ray Serve both support CPU-only deployments.

How do I handle model versioning in production?

All four platforms support deploying multiple model versions simultaneously. BentoML versions models through its model store. Triton uses versioned directories in the model repository. Seldon Core manages versions through SeldonDeployment manifests. Ray Serve tracks versions through deployment configurations. For end-to-end version tracking across the full ML lifecycle, see our data versioning guide.

Can I autoscale model serving based on traffic?

Seldon Core integrates with Kubernetes HPA and KEDA for automatic scaling based on request metrics. Ray Serve has built-in autoscaling that adjusts replica count based on queue depth. BentoML and Triton require external orchestration (Kubernetes, Docker Swarm) for autoscaling.

How do I secure my self-hosted model serving endpoints?

Place your inference server behind a reverse proxy (Traefik, Caddy, or Nginx) with TLS termination and authentication. Use API keys or OAuth2 tokens to control access. All four platforms expose metrics endpoints — restrict these to your internal monitoring network. For a comprehensive guide on securing self-hosted services, see our TLS termination proxy guide.

Advertise here
Advertise here