Running speech-to-text transcription in the cloud means sending your audio data to third-party servers, paying per-minute API fees, and accepting usage caps that can cripple high-volume workflows. Self-hosted transcription engines give you unlimited processing, complete data privacy, and zero per-call costs after the initial hardware investment.
In this guide, we compare the three leading open-source speech recognition engines you can run on your own infrastructure: OpenAI Whisper, whisper.cpp, and Vosk. Each takes a different approach to the transcription problem, and the right choice depends on your accuracy requirements, hardware constraints, and latency needs.
Why Self-Host Your Speech-to-Text Pipeline
Cloud transcription services like Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech charge $0.006–$0.024 per minute of audio. At 1,000 hours of monthly transcription, that’s $360–$1,440 every month. A self-hosted GPU server costs a one-time hardware investment and then processes unlimited audio for the price of electricity.
Beyond cost, self-hosting gives you:
- Full data privacy — audio never leaves your network, essential for healthcare, legal, and financial compliance
- No rate limits or API quotas — process 10 files or 10,000 files at the same speed
- Offline operation — transcribe on air-gapped systems or during network outages
- Custom vocabulary and domain tuning — add industry-specific terms that cloud services don’t recognize
- Predictable latency — no multi-tenant queueing, consistent response times
For related reading, see our self-hosted TTS engines guide for the reverse pipeline (text-to-speech), and our local AI tools comparison for running other models on-premise.
OpenAI Whisper — Highest Accuracy, GPU-Optimized
OpenAI released Whisper in September 2022 as an open-source speech recognition model trained on 680,000 hours of multilingual audio. It supports 99 languages and produces remarkably accurate transcripts even with heavy accents, background noise, and technical jargon.
| Metric | Value |
|---|---|
| GitHub Stars | 97,987 |
| Language | Python |
| Last Updated | April 2026 |
| Model Sizes | tiny (39M), base (74M), small (244M), medium (769M), large (1.5B) |
| GPU Support | CUDA, MPS (Apple Silicon) |
| License | MIT |
Whisper’s architecture uses a sequence-to-sequence Transformer encoder-decoder. The large-v3 model achieves near-human accuracy on most benchmark datasets and handles code-switching between languages naturally. It produces timestamps at the segment level and supports direct translation to English from any supported dockerge.
Whisper Docker Deployment
The official repository doesn’t ship a Docker Compose file, but you can wrap it easily. Here’s a production-ready setup using the popular openai/whisper pattern with GPU passthrough:
| |
For an API server approach, the community-maintained whisper-server image exposes a REST endpoint:
| |
Usage with curl:
| |
Best for: Maximum transcription accuracy, multilingual workloads, batch processing of recorded audio, and environments with dedicated GPU hardware.
whisper.cpp — Lightweight, CPU-First, Edge-Ready
whisper.cpp is a high-performance C/C++ port of OpenAI’s Whisper model by Georgi Gerganov. It eliminates the Python and PyTorch dependencies, running inference with a custom GGML tensor library that’s optimized for CPU execution and Apple Silicon.
| Metric | Value |
|---|---|
| GitHub Stars | 48,735 |
| Language | C/C++ |
| Last Updated | April 2026 |
| Model Sizes | All Whisper sizes (tiny through large), quantized to Q4/Q5/Q8 |
| Hardware | CPU, Apple Silicon, CUDA, Vulkan, SYCL |
| License | MIT |
The killer feature of whisper.cpp is quantization. While the original Whisper large model requires ~10 GB of VRAM, a Q4-quantized whisper.cpp model runs in under 3 GB of system RAM on a CPU — making it practical for Raspberry Pi 5, small VPS instances, and edge devices.
The built-in HTTP server (examples/server) provides a REST API out of the box, eliminating the need for a separate web framework.
whisper.cpp Docker Deployment
The project includes official Dockerfiles in .devops/. Here’s a complete setup with the HTTP server:
| |
Download a model file before starting the container:
| |
API usage:
| |
For GPU acceleration on NVIDIA hardware, use the main-cuda.Dockerfile from .devops/:
| |
Best for: CPU-only servers, edge devices, Apple Silicon Macs, low-memory environments, and applications that need real-time transcription with sub-second latency.
Vosk — Streaming-First, Ultra-Lightweight
Vosk is an offline speech recognition toolkit built on the Kaldi speech recognition framework. Unlike Whisper’s end-to-end neural approach, Vosk uses traditional hybrid HMM-DNN architecture, which makes it dramatically smaller and faster for specific use cases.
| Metric | Value |
|---|---|
| GitHub Stars | 14,581 |
| Language | Python/Java/C#/Node (bindings) |
| Last Updated | February 2026 |
| Model Sizes | 40 MB (nano) to 2.1 GB (large) |
| Languages | 20+ languages with dedicated models |
| Hardware | CPU (works on Raspberry Pi Zero) |
| License | Apache 2.0 |
Vosk’s standout feature is real-time streaming recognition. It processes audio incrementally as it arrives, producing partial transcripts with low latency. This makes it ideal for live captioning, voice commands, and interactive voice response (IVR) systems.
The model sizes range from 40 MB (vosk-model-small) to 2.1 GB (vosk-model-en-us-0.22), with the small models running comfortably on a Raspberry Pi with 512 MB of RAM.
Vosk Docker Deployment
Vosk provides language-specific bindings but no official Docker Compose file. Here’s a production-ready server setup using the community alphacep/kaldi image:
| |
Python client example:
| |
WebSocket streaming for real-time transcription:
| |
Best for: Real-time streaming transcription, voice command interfaces, IVR systems, Raspberry Pi and IoT deployments, and applications that need low-latency partial results.
Head-to-Head Comparison
| Feature | OpenAI Whisper | whisper.cpp | Vosk |
|---|---|---|---|
| Architecture | Transformer seq2seq | GGML-optimized Transformer | Hybrid HMM-DNN |
| Languages | 99 | 99 | 20+ |
| Accuracy (EN) | ★★★★★ | ★★★★☆ | ★★★☆☆ |
| Model Size (large) | ~10 GB | ~3 GB (Q4) | ~2.1 GB |
| Smallest Model | 39 MB (tiny) | 39 MB (tiny Q4) | 40 MB (nano) |
| GPU Required | Recommended | Optional (CPU-first) | No |
| RAM (base model) | ~1 GB | ~300 MB | ~200 MB |
| Streaming | No (batch only) | Limited | Yes (native) |
| Real-time Factor | 0.1x–0.5x (GPU) | 0.3x–1x (CPU) | 2x–10x (CPU) |
| API Server | Community only | Built-in | Built-in (WebSocket) |
| Apple Silicon | MPS (good) | Native (excellent) | Yes |
| Punctuation | Automatic | Automatic | Model-dependent |
| Speaker Diarization | Via whisper-diarization | No | No |
| Best Latency | ~500 ms (GPU) | ~200 ms (CPU) | ~50 ms (CPU) |
Choosing the Right Engine
Use OpenAI Whisper when:
- Transcription accuracy is your top priority
- You have NVIDIA GPU hardware available
- Processing recorded audio in batch (not real-time)
- You need 99-language support with high quality
- You want automatic punctuation and capitalization
Use whisper.cpp when:
- You’re running on CPU-only hardware
- Memory is constrained (under 4 GB available)
- You need Apple Silicon optimization
- You want a single binary with no Python dependencies
- Edge deployment on resource-limited devices
Use Vosk when:
- Real-time streaming is required (live captioning, voice commands)
- You need sub-100 ms partial result latency
- Deploying on Raspberry Pi or microcontrollers
- Your language list is limited to 20 common languages
- You need WebSocket-based incremental results
Deployment Tips for Production
Reverse Proxy with TLS
Put any of these engines behind a reverse pnginxfor HTTPS termination and rate limiting:
| |
Audio Format Conversion
All three engines expect 16 kHz mono WAV for optimal results. Use ffmpeg to preprocess:
| |
Scaling with Multiple Workers
For high-throughput scenarios, run multiple containers behind a load balancer:
| |
FAQ
Which speech-to-text engine is most accurate?
OpenAI Whisper (large-v3 model) achieves the highest accuracy across all benchmark datasets, with word error rates below 3% on clean English speech. whisper.cpp matches Whisper’s accuracy since it runs the same model weights in a different runtime. Vosk trails behind at 5–8% WER on the same benchmarks but excels at real-time performance where the others can’t compete.
Can I run Whisper on a CPU without a GPU?
Yes, but performance is limited. Whisper’s base model runs at roughly 0.1x real-time on a modern CPU (10 seconds of audio takes 100 seconds to process). For CPU-only deployments, whisper.cpp is the better choice — its quantized base model runs at 0.5–1x real-time on a 4-core CPU, making it practical for batch transcription.
How much disk space do the models require?
Whisper models range from 39 MB (tiny) to 10 GB (large-v3). whisper.cpp’s quantized models are smaller: the large-v3 Q4_0 fits in ~3 GB. Vosk’s largest English model is 2.1 GB, while the smallest (vosk-model-small-en-us) is just 40 MB. For most self-hosted setups, the base or small models provide the best accuracy-to-size tradeoff.
Do these engines support speaker diarization (identifying who said what)?
None of the three engines include built-in speaker diarization. For Whisper, you can use the community whisper-diarization project which combines Whisper transcription with pyannote.audio for speaker separation. whisper.cpp and Vosk require a separate diarization pipeline. If you need multi-speaker transcripts, plan to run a diarization model as a post-processing step.
What audio formats are supported?
Whisper and whisper.cpp accept any format that FFmpeg can decode (MP3, WAV, FLAC, OGG, M4A, etc.) since they call FFmpeg internally. Vosk requires 16 kHz mono PCM WAV input. If your source audio is in another format, convert it with ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav before feeding it to Vosk.
Can these tools transcribe audio in multiple languages simultaneously?
Whisper and whisper.cpp support automatic language detection when you set --language auto — they’ll detect and transcribe in any of their 99 supported languages, including code-switched audio. Vosk requires you to load a language-specific model and cannot switch languages at runtime. If you need multilingual support with Vosk, you’d need to run separate model instances and route audio to the correct one.