Self-Hosted Columnar Data Format Libraries: Apache Parquet vs Arrow vs ORC — Choosing the Right Big Data Storage Layer

Introduction

When working with analytical workloads, data warehouses, or large-scale data processing pipelines, the choice of storage format has an outsized impact on query performance, storage costs, and interoperability. Columnar formats — where data is stored column-by-column rather than row-by-row — enable efficient compression, predicate pushdown, and vectorized processing that row-based formats cannot match.

In this article, we compare the three major open-source columnar storage format libraries: Apache Parquet (3,057 ⭐ for the Java implementation), Apache Arrow (16,858 ⭐), and Apache ORC (766 ⭐). All three are Apache Software Foundation projects with broad industry adoption.

Feature	Apache Parquet	Apache Arrow	Apache ORC
GitHub Stars	3,057 (parquet-java)	16,858	766
Primary Use Case	Compressed on-disk storage	In-memory columnar format	Hive-optimized storage
Compression	Excellent (snappy, gzip, zstd, lz4)	Light (dictionary, delta)	Excellent (ZLIB, snappy, LZ4, ZSTD)
Predicate Pushdown	Full (min/max stats, bloom filters)	Column-level (via compute)	Full (min/max/sum stats, bloom)
Schema Evolution	Full support	Full support	Full support
Language Support	Java, C++, Python, Go, Rust	12+ languages	Java, C++, Python
Ecosystem	Spark, Hive, Presto, Flink, DuckDB	Pandas, Polars, DuckDB, DataFusion	Hive, Presto, Spark
Best For	Long-term storage & query engines	In-memory analytics & interchange	Hive data warehouses
License	Apache 2.0	Apache 2.0	Apache 2.0

Apache Parquet: The Universal On-Disk Format

Apache Parquet was created through a collaboration between Twitter and Cloudera in 2013, designed as a self-describing, language-agnostic columnar storage format. It has become the dominant on-disk format for big data, supported by virtually every query engine and data processing framework.

Key Features

Columnar Compression: Snappy, GZIP, ZSTD, LZ4, and Brotli with per-column encoding selection
Predicate Pushdown: Row group-level statistics (min/max/null count) skip irrelevant data
Bloom Filters: Optional bloom filter indexes for even faster point-lookup elimination
Nested Data Support: Efficient encoding of complex nested structures (arrays, maps, structs)
Page-Level Encoding: Dictionary, RLE, delta, and bit-packing encodings per column chunk

Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Python
pip install pyarrow  # Uses Apache Arrow's Parquet implementation

# Rust
cargo add parquet

# Java (Maven)
# <dependency>
#     <groupId>org.apache.parquet</groupId>
#     <artifactId>parquet-hadoop</artifactId>
#     <version>1.14.1</version>
# </dependency>

Basic Usage (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pyarrow as pa
import pyarrow.parquet as pq

# Write Parquet with compression
table = pa.table({
    'id': range(1000000),
    'value': [x * 1.5 for x in range(1000000)],
    'category': ['A', 'B', 'C', 'D'] * 250000
})

pq.write_table(table, 'data.parquet',
    compression='zstd',
    compression_level=3,
    row_group_size=100000)

# Read with predicate pushdown (reads only matching row groups)
filtered = pq.read_table('data.parquet',
    filters=[('category', '=', 'A')])
print(f"Rows: {len(filtered)}")  # 250,000

When to Choose Parquet

Parquet is the optimal choice for long-term data storage where compression ratio and query-time I/O reduction are paramount. Its deep integration with Spark, Hive, DuckDB, and Presto makes it the default format for data lake architectures. Choose Parquet when you need maximum storage efficiency with broad ecosystem compatibility.

Apache Arrow: The In-Memory Columnar Standard

Apache Arrow, initiated by Wes McKinney (creator of Pandas) in 2016, defines a language-agnostic columnar memory format optimized for analytical processing. Unlike Parquet and ORC which target on-disk storage, Arrow is designed for in-memory data representation and zero-copy interchange between systems. With 16,858 GitHub stars, it has the largest community among columnar technologies.

Key Features

Zero-Copy Interchange: Share data between Python, R, Java, C++, and Rust without serialization
Vectorized Execution: SIMD-friendly columnar layout enables cache-efficient processing
Gandiva Expression Compiler: LLVM-based expression evaluation for SQL-like filtering
Flight RPC: gRPC-based protocol for high-throughput data transfer (10-100x faster than JDBC/ODBC)
Compute Functions: Built-in kernels for filtering, sorting, grouping, and aggregation

Installation

1
2
3
4
5
6
7
8
9
# Python
pip install pyarrow

# C++ (CMake)
# find_package(Arrow REQUIRED)
# target_link_libraries(my_app PRIVATE arrow_shared)

# Rust
cargo add arrow

Basic Usage (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pyarrow as pa
import pyarrow.compute as pc

# Create Arrow table (columnar in memory)
arrays = [
    pa.array([1, 2, 3, 4, 5], type=pa.int64()),
    pa.array([10.5, 20.3, 30.1, 40.7, 50.2], type=pa.float64()),
    pa.array(['a', 'b', 'c', 'd', 'e'])
]
table = pa.Table.from_arrays(arrays, names=['id', 'value', 'label'])

# Vectorized filtering (no Python loop overhead)
mask = pc.greater(table.column('value'), 25.0)
filtered = table.filter(mask)
print(f"Filtered rows: {len(filtered)}")  # 3

# Zero-copy conversion to Pandas
df = filtered.to_pandas()

Apache Arrow Flight

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Server-side
import pyarrow.flight as flight

class MyFlightServer(flight.FlightServerBase):
    def do_get(self, context, ticket):
        return flight.RecordBatchStream(table)

# Client reads with zero-copy
client = flight.FlightClient('grpc://server:8815')
reader = client.do_get(flight.Ticket(b'my_data'))
table = reader.read_all()

When to Choose Arrow

Arrow is the best choice when you need high-performance in-memory analytics or data interchange between heterogeneous systems. Use Arrow when building real-time analytical applications, data science pipelines that cross language boundaries, or services that need to serve large result sets with minimal serialization overhead. Arrow is also the backbone of modern dataframe libraries like Polars and DuckDB.

Apache ORC: The Hive-Optimized Columnar Format

Apache ORC (Optimized Row Columnar) was created by Hortonworks (now part of Cloudera) in 2013 specifically for the Hive ecosystem. With 766 GitHub stars, it has a smaller community than Parquet and Arrow but remains deeply optimized for Hive-based data warehouses with unique features not available elsewhere.

Key Features

Stripe-Level Indexes: Min, max, sum, and count statistics per 10,000-row stripe
Bloom Filters: Built-in bloom filter indexes at the stripe level
ACID Transactions: Full ACID transaction support in Hive (INSERT, UPDATE, DELETE, MERGE)
Lightweight Compression: Run-length encoding optimized for sorted repetitive data
Type Evolution: Schema evolution with backward compatibility

Installation

1
2
3
4
5
6
7
8
9
# Python (via PyORC)
pip install pyorc

# Java (Maven)
# <dependency>
#     <groupId>org.apache.orc</groupId>
#     <artifactId>orc-core</artifactId>
#     <version>1.9.3</version>
# </dependency>

Basic Usage (Python with PyORC)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pyorc

schema = "struct<id:int,name:string,score:double>"

with open('data.orc', 'wb') as f:
    with pyorc.Writer(f, schema, 
        compression=pyorc.CompressionKind.ZSTD,
        bloom_filter_columns=['id']) as writer:
        writer.write((1, 'Alice', 95.5))
        writer.write((2, 'Bob', 88.3))
        writer.write((3, 'Charlie', 92.1))

# Read with predicate pushdown
with open('data.orc', 'rb') as f:
    reader = pyorc.Reader(f)
    for row in reader:
        if row.score > 90:
            print(f"{row.name}: {row.score}")

When to Choose ORC

ORC is the best choice when your infrastructure is built around Apache Hive and you need ACID transaction support on columnar storage. Its stripe-level statistics and bloom filter indexes provide excellent query performance for Hive workloads. However, for multi-engine environments (Spark + Presto + DuckDB), Parquet offers broader compatibility.

Compression and Performance Tradeoffs

All three formats use columnar compression, but their approaches and efficiencies differ:

Compression Aspect	Parquet	Arrow	ORC
Default Algorithm	Snappy	Dictionary	ZLIB
Compression Ratio	5-10x	2-4x	6-12x
Decompression Speed	Fast	Very Fast (direct)	Moderate
Column-Level Config	Yes	Per-buffer	Yes
Encoding Variants	Dictionary, RLE, Delta	Dictionary, RLE	RLE v1/v2, Dictionary

Parquet with ZSTD compression level 3 offers the best balance of compression ratio and decompression speed for general-purpose data lakes. ORC with ZLIB achieves the highest compression ratios but at the cost of slower reads — useful for cold archival data that’s queried infrequently.

Why Self-Host Your Columnar Data Pipeline?

Running your own columnar storage infrastructure gives you full control over data locality, access patterns, and compression strategies. Unlike managed cloud warehouses, self-hosted Parquet/Arrow/ORC stacks let you optimize for your specific query patterns without per-query pricing penalties. A well-configured local Parquet lake can serve analytical queries at a fraction of cloud warehouse costs.

For OLAP query engines that consume these formats, see our DuckDB vs Apache Doris vs Apache Pinot comparison. If you’re storing time-series data that benefits from columnar compression, our M3DB vs QuestDB vs TimescaleDB guide covers specialized time-series columnar stores. ’''

Building a Multi-Format Data Pipeline

In production environments, the best architecture often uses all three formats at different pipeline stages. Here’s a battle-tested pattern used by data engineering teams processing terabytes daily.

Ingestion Layer (ORC → Parquet): If your data originates in Hive, ingest into ORC tables for ACID transaction support and incremental updates. Periodically compact ORC partitions into Parquet files for broader query engine compatibility. Tools like Apache Hudi and Delta Lake automate this compaction while maintaining transactional guarantees.

Processing Layer (Arrow): Load Parquet files into Arrow for in-memory vectorized processing. DuckDB and Polars both use Arrow as their internal representation, enabling zero-copy querying across Parquet datasets. For streaming use cases, Arrow Flight transmits data between services with gRPC-native efficiency — eliminating the serialization bottleneck of REST/JSON APIs.

Serving Layer (Parquet ± Arrow Flight): Store final analytical datasets as Parquet with ZSTD compression (best size/speed balance) and row group sizes tuned for your query patterns. For interactive dashboards, serve via Arrow Flight to enable sub-second query responses from BI tools. For batch reporting, direct Parquet reads from Spark or Trino suffice.

1
2
3
4
5
6
7
8
9
-- DuckDB: seamless multi-format querying
SELECT 
    category,
    COUNT(*) as count,
    AVG(score) as avg_score
FROM read_parquet('analytics/*.parquet')
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY avg_score DESC;

This pipeline design gives you Hive’s transactional ingestion, Arrow’s processing speed, and Parquet’s storage efficiency — each format doing what it does best. ’''

FAQ

Should I use Parquet or Arrow for my data lake?

Use both — they’re complementary. Store data in Parquet on disk (compressed, predicate-pushdown optimized) and use Arrow in memory for processing (zero-copy, vectorized). Tools like PyArrow handle the Parquet ↔ Arrow conversion seamlessly, giving you the best of both worlds.

Does Arrow replace Parquet?

No. Arrow is an in-memory format designed for processing speed and zero-copy interchange. Parquet is an on-disk format designed for storage efficiency and query optimization. Arrow can read/write Parquet files (via pyarrow.parquet), and many query engines use Arrow in-memory while storing Parquet on disk.

Can ORC files be used outside of Hive?

Yes, but with caveats. ORC readers exist for Python (pyorc), Presto, Spark, and Flink. However, the development velocity and cross-engine testing for ORC is lower than Parquet. For greenfield projects that aren’t Hive-centric, Parquet is generally the more portable choice.

How do columnar formats compare to row-based formats like Avro?

Columnar formats (Parquet, ORC) excel at analytical queries that read a subset of columns (e.g., SELECT AVG(price) FROM sales). Row-based formats (Avro, JSON Lines) are better for transactional workloads that read entire records. Use columnar for analytics, row-based for event streaming and message queues.

What about Apache Iceberg and Delta Lake?

Iceberg and Delta Lake are table formats (metadata layers) that sit on top of storage formats like Parquet and ORC. They add ACID transactions, time travel, and schema evolution at the table level. Both use Parquet as their default underlying storage format. Choose a table format for production data lakes requiring transactional guarantees.

How do I migrate existing CSV/JSON data to Parquet?

The simplest approach is using DuckDB or PyArrow:

1
2
-- DuckDB: one-liner conversion
COPY (SELECT * FROM read_csv_auto('data.csv')) TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

For large-scale migrations, Apache Spark or PyArrow with row group batching provides optimal throughput.

💰 想测试你的市场判断力？我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台，从大选结果到技术监管时间线，什么都可以押注。和赌博不同，这是真正的信息市场：你懂的信息越多，胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册：Polymarket.com

Introduction

Apache Parquet: The Universal On-Disk Format

Key Features

Installation

Basic Usage (Python)

When to Choose Parquet

Apache Arrow: The In-Memory Columnar Standard

Key Features

Installation

Basic Usage (Python)

Apache Arrow Flight

When to Choose Arrow

Apache ORC: The Hive-Optimized Columnar Format

Key Features

Installation

Basic Usage (Python with PyORC)

When to Choose ORC

Compression and Performance Tradeoffs

Why Self-Host Your Columnar Data Pipeline?

Building a Multi-Format Data Pipeline

FAQ

Should I use Parquet or Arrow for my data lake?

Does Arrow replace Parquet?

Can ORC files be used outside of Hive?

How do columnar formats compare to row-based formats like Avro?

What about Apache Iceberg and Delta Lake?

How do I migrate existing CSV/JSON data to Parquet?

Related Posts

Self-Hosted DataFrame Processing Libraries: Polars vs Vaex vs datatable

Self-Hosted Graph Computing Frameworks: GraphScope vs Apache Giraph vs Spark GraphX Compared

Self-Hosted Headless BI: MetricFlow vs Malloy vs Cube.js for the Semantic Layer