← Back to posts
comparison guide self-hosted · · 11 min read

DuckDB vs Apache Doris vs Apache Pinot: Best Self-Hosted OLAP Databases 2026

Compare DuckDB, Apache Doris, and Apache Pinot for self-hosted OLAP and analytical data warehousing. Includes Docker deployment guides, performance benchmarks, and use case recommendations.

OS
Editorial Team

When you need fast analytical queries over large datasets, choosing the right OLAP (Online Analytical Processing) database is critical. In 2026, three open-source options stand out for different use cases: DuckDB for embedded, single-node analytics; Apache Doris for high-performance distributed data warehousing; and Apache Pinot for real-time analytics at scale.

This guide compares all three, provides Docker deployment configurations, and helps you pick the right tool for your workload.

FeatureDuckDBApache DorisApache Pinot
GitHub Stars37,71515,2626,069
LanguageC++JavaJava
ArchitectureEmbedded, single-processDistributed MPPDistributed, real-time
DeploymentLibrary / Docker containerDocker Compose, KubernetesDocker Compose, Kubernetes
Query LanguageSQL (PostgreSQL dialect)SQL (MySQL dialect)SQL (PQL)
StorageColumnar, on-disk or in-memoryColumnar, distributed storageColumnar, segment-based
Real-time ingestionNo (batch-oriented)Yes (stream + batch)Yes (optimized for streaming)
Min RAM~512 MB~8 GB (FE + BE)~4 GB (full stack)
Max dataset sizeLimited by single-node RAM/diskPetabytes (distributed)Petabytes (distributed)
Best forData science, ETL, local analyticsEnterprise data warehouse, BIReal-time dashboards, user-facing analytics
LicenseMITApache 2.0Apache 2.0
Last UpdatedApril 2026April 2026April 2026

Why Self-Host Your OLAP Database?

Cloud data warehouses like Snowflake, BigQuery, and Redshift are convenient but come with significant costs at scale. Self-hosting gives you:

  • Cost control: No per-query or per-TB pricing. Your infrastructure cost is fixed.
  • Data sovereignty: Sensitive data stays on your servers, never leaves your network.
  • No vendor lock-in: Open-source SQL engines mean you can migrate or scale freely.
  • Custom performance tuning: Adjust memory, storage, and query settings to your exact workload.
  • Offline operation: Run analytics in air-gapped environments or on edge devices.

For organizations processing terabytes of data daily, self-hosted OLAP engines can reduce costs by 60-80% compared to cloud equivalents while maintaining sub-second query performance.

DuckDB: Embedded OLAP for Single-Node Analytics

DuckDB is an in-process analytical SQL database. Unlike traditional databases that run as a server, DuckDB runs inside your application process — similar to SQLite but optimized for analytical queries instead of transactional ones.

When to Use DuckDB

  • Data science workflows: Fast CSV/Parquet analysis in Python, R, or Jupyter notebooks
  • ETL pipeline processing: Transform and aggregate data before loading into a larger warehouse
  • Local development: Test analytical queries against production-like data without a server
  • Edge analytics: Run OLAP queries on laptops, IoT devices, or resource-constrained environments
  • Single-user dashboards: Power small-scale dashboards where concurrency is low

DuckDB Installation

Install via pip for Python:

1
pip install duckdb

Or use the CLI tool:

1
2
3
4
# Linux/macOS
curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip
sudo mv duckdb /usr/local/bin/

DuckDB Docker Deployment

While DuckDB is designed to run as a library, you can containerize it for server-based access:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
version: "3.8"
services:
  duckdb-server:
    image: chdb/duckdb-server:latest
    ports:
      - "8000:8000"
    volumes:
      - ./data:/data
      - ./scripts:/scripts
    environment:
      - DUCKDB_EXTENSION_PATH=/data
    restart: unless-stopped

DuckDB Quick Start

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
-- Query a Parquet file directly
SELECT * FROM read_parquet('data/sales/*.parquet')
WHERE sale_date >= '2026-01-01'
GROUP BY region, product_category
ORDER BY total_revenue DESC;

-- Create a persistent database and load data
CREATE TABLE events AS
SELECT * FROM read_csv_auto('events.csv');

-- Complex analytical query with window functions
SELECT
    user_id,
    session_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_event,
    event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS time_since_last
FROM events;

DuckDB supports reading Parquet, CSV, JSON, and SQLite files directly — no data import required. Its vectorized query engine delivers exceptional single-node performance, often outperforming Spark for datasets under 100 GB.

Apache Doris: High-Performance Distributed Data Warehouse

Apache Doris is a massively parallel processing (MPP) analytical database designed for real-time data warehousing. It provides a MySQL-compatible SQL interface and supports both batch and streaming data ingestion.

When to Use Apache Doris

  • Enterprise data warehousing: Replace expensive cloud warehouses with a self-hosted alternative
  • Real-time BI dashboards: Sub-second query response for tools like Superset or Metabase
  • Log analytics: Ingest and analyze logs in near real-time with automatic rollups
  • Multi-tenant analytics: Serve queries from multiple teams with resource isolation
  • Ad-hoc reporting: Analysts running complex joins across billions of rows

Apache Doris Architecture

Doris uses a two-node architecture:

  • Frontend (FE): Handles query parsing, planning, and metadata management
  • Backend (BE): Stores data and executes query fragments

This separation allows independent scaling of compute and storage.

Apache Doris Docker Compose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
version: "3.8"
services:
  doris-fe:
    image: apache/doris:2.1-fe-x86_64
    container_name: doris-fe
    hostname: doris-fe
    ports:
      - "8030:8030"
      - "9030:9030"
    environment:
      - FE_SERVERS=fe1:172.20.80.2:9010
      - FE_ID=1
    volumes:
      - doris-fe-meta:/opt/apache-doris/fe/doris-meta
    restart: unless-stopped

  doris-be:
    image: apache/doris:2.1-be-x86_64
    container_name: doris-be
    hostname: doris-be
    ports:
      - "8040:8040"
      - "9050:9050"
    environment:
      - FE_SERVERS=fe1:172.20.80.2:9010
      - BE_ADDR=172.20.80.3:9050
    volumes:
      - doris-be-storage:/opt/apache-doris/be/storage
    depends_on:
      - doris-fe
    restart: unless-stopped

volumes:
  doris-fe-meta:
  doris-be-storage:

Start the cluster:

1
2
3
4
5
docker compose up -d

# Wait 30 seconds for FE to initialize, then register BE
docker exec -it doris-fe mysql -h 127.0.0.1 -P 9030 -u root -e \
  "ALTER SYSTEM ADD BACKEND 'doris-be:9050';"

Apache Doris SQL Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Create a table with automatic partitioning
CREATE TABLE sales_orders (
    order_id BIGINT,
    customer_id INT,
    product_name VARCHAR(255),
    amount DECIMAL(10,2),
    order_date DATE,
    status VARCHAR(32)
)
DUPLICATE KEY(order_id)
PARTITION BY RANGE(order_date) (
    PARTITION p2025q1 VALUES LESS THAN ('2025-04-01'),
    PARTITION p2025q2 VALUES LESS THAN ('2025-07-01'),
    PARTITION p2025q3 VALUES LESS THAN ('2025-10-01'),
    PARTITION p2025q4 VALUES LESS THAN ('2026-01-01'),
    PARTITION p2026q1 VALUES LESS THAN ('2026-04-01')
)
DISTRIBUTED BY HASH(order_id) BUCKETS 8;

-- Stream data insertion (real-time)
INSERT INTO sales_orders VALUES
    (1001, 42, 'Widget A', 29.99, '2026-03-15', 'completed'),
    (1002, 17, 'Widget B', 49.99, '2026-03-16', 'pending');

-- Query with materialized view for fast aggregation
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT order_date, SUM(amount) AS total, COUNT(*) AS orders
FROM sales_orders
GROUP BY order_date;

Doris achieves impressive performance through columnar storage, vectorized execution, and automatic materialized view maintenance. It handles billions of rows with sub-second latency for typical BI queries.

Apache Pinot: Real-Time Analytics at Scale

Apache Pinot is a distributed OLAP database built for real-time analytics. Originally developed at LinkedIn for powering user-facing analytics features, Pinot excels at low-latency queries on streaming data.

When to Use Apache Pinot

  • User-facing analytics: Embed real-time charts and metrics directly in your application
  • Clickstream analysis: Process and query millions of events per second
  • Time-series dashboards: Monitor IoT, infrastructure, or business metrics in real time
  • A/B testing platforms: Segment and analyze experiment results instantly
  • Anomaly detection: Run rolling aggregations and statistical queries on live data

Apache Pinot Architecture

Pinot uses a multi-component architecture:

  • ZooKeeper: Cluster coordination and metadata
  • Controller: Manages segment lifecycle, data ingestion, and table configuration
  • Broker: Receives queries and routes them to servers
  • Server: Stores data segments and executes query fragments

Apache Pinot Docker Compose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
version: "3.8"
services:
  zookeeper:
    image: zookeeper:3.9
    container_name: pinot-zk
    ports:
      - "2181:2181"
    restart: unless-stopped

  pinot-controller:
    image: apachepinot/pinot:latest
    container_name: pinot-controller
    command: StartController -zkAddress pinot-zk:2181
    ports:
      - "9000:9000"
    environment:
      - JAVA_OPTS=-Xms256m -Xmx512m
    depends_on:
      - zookeeper
    restart: unless-stopped

  pinot-broker:
    image: apachepinot/pinot:latest
    container_name: pinot-broker
    command: StartBroker -zkAddress pinot-zk:2181
    ports:
      - "8099:8099"
    environment:
      - JAVA_OPTS=-Xms256m -Xmx512m
    depends_on:
      - pinot-controller
    restart: unless-stopped

  pinot-server:
    image: apachepinot/pinot:latest
    container_name: pinot-server
    command: StartServer -zkAddress pinot-zk:2181
    ports:
      - "8098:8098"
    environment:
      - JAVA_OPTS=-Xms512m -Xmx1g
    depends_on:
      - pinot-broker
    restart: unless-stopped

Start the full stack:

1
2
3
4
docker compose up -d

# Access the Pinot UI at http://localhost:9000
# Create a table and start ingesting data

Apache Pinot Table Configuration

Create a real-time table via the REST API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Create a schema
cat > events_schema.json << 'SCHEMA'
{
  "schemaName": "events",
  "dimensionFieldSpecs": [
    {"name": "userId", "dataType": "STRING"},
    {"name": "eventType", "dataType": "STRING"},
    {"name": "page", "dataType": "STRING"}
  ],
  "metricFieldSpecs": [
    {"name": "duration", "dataType": "INT"},
    {"name": "revenue", "dataType": "DOUBLE"}
  ],
  "dateTimeFieldSpecs": [
    {"name": "timestamp", "dataType": "TIMESTAMP", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS"}
  ]
}
SCHEMA

curl -X POST -F "schema=@events_schema.json" \
  http://localhost:9000/tables/schema

# Query with PQL (Pinot Query Language)
curl -X POST -H "Content-Type: application/json" \
  http://localhost:8099/query/sql \
  -d '{"sql": "SELECT eventType, COUNT(*) FROM events WHERE timestamp > ago(1h) GROUP BY eventType"}'

Pinot’s strength is sub-second query latency on continuously updating data. Unlike Doris which optimizes for batch + streaming, Pinot is purpose-built for real-time ingestion and immediate query availability.

Performance Comparison

BenchmarkDuckDBApache DorisApache Pinot
1 GB scan~50ms~120ms~200ms
100 GB aggregation~2s~800ms~1.5s
1 TB joinN/A (single-node)~5s~8s
Real-time latencyN/A~100ms~10ms
Concurrent queries1-4100+1,000+
Ingestion rate~500K rows/s~2M rows/s~10M rows/s

Note: Results vary based on hardware, configuration, and query complexity. These are approximate benchmarks from common workloads.

Choosing the Right OLAP Database

Choose DuckDB if:

  • You need fast analytical queries on a single machine
  • Your data fits in RAM or on a single disk
  • You are doing data science, ETL, or local analysis
  • You want zero operational overhead (no server to manage)
  • Budget is limited and you want to use existing hardware

Choose Apache Doris if:

  • You need a full-featured data warehouse for your organization
  • You want MySQL-compatible SQL for easy migration
  • You need both real-time and batch ingestion
  • Your team is familiar with MySQL tooling
  • You need materialized views and automatic rollups

Choose Apache Pinot if:

  • You need sub-second queries on continuously updating data
  • You are building user-facing analytics features
  • You process millions of events per second
  • You need high concurrency (1,000+ queries/second)
  • You want built-in support for streaming ingestion from Kafka

Integration Ecosystem

All three databases integrate with the broader data stack:

IntegrationDuckDBApache DorisApache Pinot
BI ToolsSuperset, MetabaseSuperset, Metabase, TableauSuperset, Metabase
PythonNative (pip install)JDBC driverPython client
KafkaVia Python pipelineNative connectorNative connector
ParquetNative read/writeNative read/writeSegment import
S3/MinIOExtension (httpfs)Native connectorVia ingestion job
dbtdbt-duckdb adapterdbt-doris adapterLimited support

For data pipelines that feed these databases, see our data pipeline orchestration guide and Airbyte vs Meltano comparison.

FAQ

Is DuckDB suitable for production server deployments?

DuckDB is designed primarily as an embedded database, similar to SQLite. For production server use, it works best as part of a larger data pipeline — processing data before loading it into a distributed warehouse like Doris or Pinot. Projects like DuckDB Server and chDB provide HTTP APIs for server-like access, but they lack the concurrency and multi-user features of dedicated analytical databases.

How does Apache Doris compare to ClickHouse?

Both are high-performance columnar databases, but Doris offers better MySQL compatibility and a simpler operational model. Doris separates compute (FE) from storage (BE), making it easier to scale. ClickHouse typically delivers raw query performance advantages but has a steeper learning curve. For teams already using MySQL tooling, Doris is often the easier transition. See our ClickHouse vs Druid vs Pinot comparison for more alternatives.

Can Apache Pinot handle historical data, or is it real-time only?

Pinot handles both real-time and offline (historical) data. You can ingest historical data as batch segments and real-time data through Kafka streams. The query engine transparently combines results from both data sources, so your queries see a unified view.

What is the minimum hardware required to run Apache Doris?

For development and testing, Apache Doris can run on a single machine with 8 GB RAM. The Frontend needs approximately 2 GB and the Backend needs 4 GB minimum. For production workloads with billions of rows, plan for 32+ GB RAM and SSD storage. Multiple Backend nodes should be deployed for high availability.

How do I migrate from a cloud data warehouse to a self-hosted OLAP database?

The migration path depends on your source:

  1. Export data from Snowflake or BigQuery as Parquet files
  2. Load into DuckDB for initial analysis and schema design
  3. For distributed scale, use Doris S3 reader or Pinot segment builder to import Parquet
  4. Rebuild materialized views and indexes in the target system
  5. Update BI tool connections to point to the new database

Does DuckDB support concurrent writes?

No. DuckDB uses a single-writer model — only one process can write to a database file at a time. Multiple readers can access the same file concurrently. For multi-writer scenarios, use Apache Doris or Apache Pinot which support concurrent writes across distributed nodes.

Advertise here
Advertise here