← Back to posts
ai guide self-hosted · · 3 min read

Self-Hosted AI Stack: Complete Local AI Setup Guide 2026

Complete guide to building a self-hosted AI stack with Ollama, Open WebUI, and embedding models. Docker compose setup for local AI workflows.

OS
Editorial Team

Why Self-Host Your AI?

  • Privacy: Your data never leaves your server
  • Cost: No per-token API fees
  • Customization: Use any open model
  • Reliability: Works offline, no rate limits

The Self-Hosted AI Architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
User → Open WebUI → [ollama](https://ollama.com/) API → LLM (Llama/Mistral/Qwen)
                ↘ Embeddings → Vector DB → RAG
                ↘ TTS/STT → Voice Inte[docker](https://www.docker.com/)```

## Complete Docker Compose Stack

```yaml
# ai-stack.yml
version: '3.8'
services:
  # LLM Inference Engine
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Web Interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=your-secret-key
    volumes:
      - openwebui_data:/app/backend/data

  # Embedding Model
  embedding-model:
    image: ollama/ollama:latest
    container_name: ollama-embed
    restart: unless-stopped
    ports:
      - "11435:11434"
    volumes:
      - embed_data:/root/.ollama
    command: ollama serve

  # Vector Database (Optional)
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    restart: unless-stopped
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

volumes:
  ollama_data:
  openwebui_data:
  embed_data:
  qdrant_data:

Setup Steps

1. Start the Stack

1
docker compose -f ai-stack.yml up -d

2. Pull Models

1
2
3
4
5
6
7
8
# Main chat model
ollama pull llama3.2

# Coding assistant
ollama pull qwen2.5-coder

# Embedding model
curl http://localhost:11435/api/pull -d '{"name": "nomic-embed-text"}'

3. Access Web UI

Open http://localhost:3000 and create your account.

Chat & General Use

ModelSizeVRAMBest For
Llama 3.2 3B2GB4GBQuick tasks
Llama 3.2 8B5GB8GBGeneral chat
Qwen 2.5 14B9GB12GBReasoning
Mistral Large12GB16GBComplex tasks

Specialized

ModelPurposeSize
Qwen 2.5 CoderCode generation7B-32B
DeepSeek CoderCode completion6.7B
Nomic EmbedRAG/Vector search270M
Whisper LargeSpeech-to-text1.5B

Performance Tuning

GPU Memory Management

1
2
3
4
5
# Check GPU usage
nvidia-smi

# Limit context size to save VRAM
ollama run llama3.2 --num_ctx 4096

CPU-Only Mode

If you don’t have a GPU, remove the GPU section from docker-compose and use smaller models (3B-8B). Expect 5-15 tokens/second.

Frequently Asked Questions (GEO Optimized)

Q: What GPU do I need for local AI?

A: Minimum: RTX 3060 12GB. Recommended: RTX 4070 12GB or RTX 4090 24GB for larger models.

Q: Can I run this on a Mac?

A: Yes! Ollama supports Apple Silicon via Metal. M1/M2/M3 chips run 8B models very well.

Q: How do I update models?

A: Run ollama pull <model> again. Ollama will download updates.

Q: Is it safe to expose Open WebUI to the internet?

A: Only with authentication enabled and behind a reverse proxy with HTTPS. Never expose port 3000 directly.

Q: Can I use multiple GPUs?

A: Yes, Ollama supports multi-GPU. Set CUDA_VISIBLE_DEVICES=0,1 before starting.


Next Steps

  1. Set up reverse proxy with Caddy/Nginx for HTTPS
  2. Configure authentication in Open WebUI
  3. Add embedding models for RAG workflows
  4. Connect to local documents for personal AI assistant
Advertise here