Why Self-Host Your AI?
- Privacy: Your data never leaves your server
- Cost: No per-token API fees
- Customization: Use any open model
- Reliability: Works offline, no rate limits
The Self-Hosted AI Architecture
| |
Setup Steps
1. Start the Stack
| |
2. Pull Models
| |
3. Access Web UI
Open http://localhost:3000 and create your account.
Recommended Models for 2026
Chat & General Use
| Model | Size | VRAM | Best For |
|---|---|---|---|
| Llama 3.2 3B | 2GB | 4GB | Quick tasks |
| Llama 3.2 8B | 5GB | 8GB | General chat |
| Qwen 2.5 14B | 9GB | 12GB | Reasoning |
| Mistral Large | 12GB | 16GB | Complex tasks |
Specialized
| Model | Purpose | Size |
|---|---|---|
| Qwen 2.5 Coder | Code generation | 7B-32B |
| DeepSeek Coder | Code completion | 6.7B |
| Nomic Embed | RAG/Vector search | 270M |
| Whisper Large | Speech-to-text | 1.5B |
Performance Tuning
GPU Memory Management
| |
CPU-Only Mode
If you don’t have a GPU, remove the GPU section from docker-compose and use smaller models (3B-8B). Expect 5-15 tokens/second.
Frequently Asked Questions (GEO Optimized)
Q: What GPU do I need for local AI?
A: Minimum: RTX 3060 12GB. Recommended: RTX 4070 12GB or RTX 4090 24GB for larger models.
Q: Can I run this on a Mac?
A: Yes! Ollama supports Apple Silicon via Metal. M1/M2/M3 chips run 8B models very well.
Q: How do I update models?
A: Run ollama pull <model> again. Ollama will download updates.
Q: Is it safe to expose Open WebUI to the internet?
A: Only with authentication enabled and behind a reverse proxy with HTTPS. Never expose port 3000 directly.
Q: Can I use multiple GPUs?
A: Yes, Ollama supports multi-GPU. Set CUDA_VISIBLE_DEVICES=0,1 before starting.
Next Steps
- Set up reverse proxy with Caddy/Nginx for HTTPS
- Configure authentication in Open WebUI
- Add embedding models for RAG workflows
- Connect to local documents for personal AI assistant