Introduction

When working with analytical workloads, data warehouses, or large-scale data processing pipelines, the choice of storage format has an outsized impact on query performance, storage costs, and interoperability. Columnar formats — where data is stored column-by-column rather than row-by-row — enable efficient compression, predicate pushdown, and vectorized processing that row-based formats cannot match.

In this article, we compare the three major open-source columnar storage format libraries: Apache Parquet (3,057 ⭐ for the Java implementation), Apache Arrow (16,858 ⭐), and Apache ORC (766 ⭐). All three are Apache Software Foundation projects with broad industry adoption.

FeatureApache ParquetApache ArrowApache ORC
GitHub Stars3,057 (parquet-java)16,858766
Primary Use CaseCompressed on-disk storageIn-memory columnar formatHive-optimized storage
CompressionExcellent (snappy, gzip, zstd, lz4)Light (dictionary, delta)Excellent (ZLIB, snappy, LZ4, ZSTD)
Predicate PushdownFull (min/max stats, bloom filters)Column-level (via compute)Full (min/max/sum stats, bloom)
Schema EvolutionFull supportFull supportFull support
Language SupportJava, C++, Python, Go, Rust12+ languagesJava, C++, Python
EcosystemSpark, Hive, Presto, Flink, DuckDBPandas, Polars, DuckDB, DataFusionHive, Presto, Spark
Best ForLong-term storage & query enginesIn-memory analytics & interchangeHive data warehouses
LicenseApache 2.0Apache 2.0Apache 2.0

Apache Parquet: The Universal On-Disk Format

Apache Parquet was created through a collaboration between Twitter and Cloudera in 2013, designed as a self-describing, language-agnostic columnar storage format. It has become the dominant on-disk format for big data, supported by virtually every query engine and data processing framework.

Key Features

  • Columnar Compression: Snappy, GZIP, ZSTD, LZ4, and Brotli with per-column encoding selection
  • Predicate Pushdown: Row group-level statistics (min/max/null count) skip irrelevant data
  • Bloom Filters: Optional bloom filter indexes for even faster point-lookup elimination
  • Nested Data Support: Efficient encoding of complex nested structures (arrays, maps, structs)
  • Page-Level Encoding: Dictionary, RLE, delta, and bit-packing encodings per column chunk

Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Python
pip install pyarrow  # Uses Apache Arrow's Parquet implementation

# Rust
cargo add parquet

# Java (Maven)
# <dependency>
#     <groupId>org.apache.parquet</groupId>
#     <artifactId>parquet-hadoop</artifactId>
#     <version>1.14.1</version>
# </dependency>

Basic Usage (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pyarrow as pa
import pyarrow.parquet as pq

# Write Parquet with compression
table = pa.table({
    'id': range(1000000),
    'value': [x * 1.5 for x in range(1000000)],
    'category': ['A', 'B', 'C', 'D'] * 250000
})

pq.write_table(table, 'data.parquet',
    compression='zstd',
    compression_level=3,
    row_group_size=100000)

# Read with predicate pushdown (reads only matching row groups)
filtered = pq.read_table('data.parquet',
    filters=[('category', '=', 'A')])
print(f"Rows: {len(filtered)}")  # 250,000

When to Choose Parquet

Parquet is the optimal choice for long-term data storage where compression ratio and query-time I/O reduction are paramount. Its deep integration with Spark, Hive, DuckDB, and Presto makes it the default format for data lake architectures. Choose Parquet when you need maximum storage efficiency with broad ecosystem compatibility.

Apache Arrow: The In-Memory Columnar Standard

Apache Arrow, initiated by Wes McKinney (creator of Pandas) in 2016, defines a language-agnostic columnar memory format optimized for analytical processing. Unlike Parquet and ORC which target on-disk storage, Arrow is designed for in-memory data representation and zero-copy interchange between systems. With 16,858 GitHub stars, it has the largest community among columnar technologies.

Key Features

  • Zero-Copy Interchange: Share data between Python, R, Java, C++, and Rust without serialization
  • Vectorized Execution: SIMD-friendly columnar layout enables cache-efficient processing
  • Gandiva Expression Compiler: LLVM-based expression evaluation for SQL-like filtering
  • Flight RPC: gRPC-based protocol for high-throughput data transfer (10-100x faster than JDBC/ODBC)
  • Compute Functions: Built-in kernels for filtering, sorting, grouping, and aggregation

Installation

1
2
3
4
5
6
7
8
9
# Python
pip install pyarrow

# C++ (CMake)
# find_package(Arrow REQUIRED)
# target_link_libraries(my_app PRIVATE arrow_shared)

# Rust
cargo add arrow

Basic Usage (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pyarrow as pa
import pyarrow.compute as pc

# Create Arrow table (columnar in memory)
arrays = [
    pa.array([1, 2, 3, 4, 5], type=pa.int64()),
    pa.array([10.5, 20.3, 30.1, 40.7, 50.2], type=pa.float64()),
    pa.array(['a', 'b', 'c', 'd', 'e'])
]
table = pa.Table.from_arrays(arrays, names=['id', 'value', 'label'])

# Vectorized filtering (no Python loop overhead)
mask = pc.greater(table.column('value'), 25.0)
filtered = table.filter(mask)
print(f"Filtered rows: {len(filtered)}")  # 3

# Zero-copy conversion to Pandas
df = filtered.to_pandas()

Apache Arrow Flight

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Server-side
import pyarrow.flight as flight

class MyFlightServer(flight.FlightServerBase):
    def do_get(self, context, ticket):
        return flight.RecordBatchStream(table)

# Client reads with zero-copy
client = flight.FlightClient('grpc://server:8815')
reader = client.do_get(flight.Ticket(b'my_data'))
table = reader.read_all()

When to Choose Arrow

Arrow is the best choice when you need high-performance in-memory analytics or data interchange between heterogeneous systems. Use Arrow when building real-time analytical applications, data science pipelines that cross language boundaries, or services that need to serve large result sets with minimal serialization overhead. Arrow is also the backbone of modern dataframe libraries like Polars and DuckDB.

Apache ORC: The Hive-Optimized Columnar Format

Apache ORC (Optimized Row Columnar) was created by Hortonworks (now part of Cloudera) in 2013 specifically for the Hive ecosystem. With 766 GitHub stars, it has a smaller community than Parquet and Arrow but remains deeply optimized for Hive-based data warehouses with unique features not available elsewhere.

Key Features

  • Stripe-Level Indexes: Min, max, sum, and count statistics per 10,000-row stripe
  • Bloom Filters: Built-in bloom filter indexes at the stripe level
  • ACID Transactions: Full ACID transaction support in Hive (INSERT, UPDATE, DELETE, MERGE)
  • Lightweight Compression: Run-length encoding optimized for sorted repetitive data
  • Type Evolution: Schema evolution with backward compatibility

Installation

1
2
3
4
5
6
7
8
9
# Python (via PyORC)
pip install pyorc

# Java (Maven)
# <dependency>
#     <groupId>org.apache.orc</groupId>
#     <artifactId>orc-core</artifactId>
#     <version>1.9.3</version>
# </dependency>

Basic Usage (Python with PyORC)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pyorc

schema = "struct<id:int,name:string,score:double>"

with open('data.orc', 'wb') as f:
    with pyorc.Writer(f, schema, 
        compression=pyorc.CompressionKind.ZSTD,
        bloom_filter_columns=['id']) as writer:
        writer.write((1, 'Alice', 95.5))
        writer.write((2, 'Bob', 88.3))
        writer.write((3, 'Charlie', 92.1))

# Read with predicate pushdown
with open('data.orc', 'rb') as f:
    reader = pyorc.Reader(f)
    for row in reader:
        if row.score > 90:
            print(f"{row.name}: {row.score}")

When to Choose ORC

ORC is the best choice when your infrastructure is built around Apache Hive and you need ACID transaction support on columnar storage. Its stripe-level statistics and bloom filter indexes provide excellent query performance for Hive workloads. However, for multi-engine environments (Spark + Presto + DuckDB), Parquet offers broader compatibility.

Compression and Performance Tradeoffs

All three formats use columnar compression, but their approaches and efficiencies differ:

Compression AspectParquetArrowORC
Default AlgorithmSnappyDictionaryZLIB
Compression Ratio5-10x2-4x6-12x
Decompression SpeedFastVery Fast (direct)Moderate
Column-Level ConfigYesPer-bufferYes
Encoding VariantsDictionary, RLE, DeltaDictionary, RLERLE v1/v2, Dictionary

Parquet with ZSTD compression level 3 offers the best balance of compression ratio and decompression speed for general-purpose data lakes. ORC with ZLIB achieves the highest compression ratios but at the cost of slower reads — useful for cold archival data that’s queried infrequently.

Why Self-Host Your Columnar Data Pipeline?

Running your own columnar storage infrastructure gives you full control over data locality, access patterns, and compression strategies. Unlike managed cloud warehouses, self-hosted Parquet/Arrow/ORC stacks let you optimize for your specific query patterns without per-query pricing penalties. A well-configured local Parquet lake can serve analytical queries at a fraction of cloud warehouse costs.

For OLAP query engines that consume these formats, see our DuckDB vs Apache Doris vs Apache Pinot comparison. If you’re storing time-series data that benefits from columnar compression, our M3DB vs QuestDB vs TimescaleDB guide covers specialized time-series columnar stores. ’''

Building a Multi-Format Data Pipeline

In production environments, the best architecture often uses all three formats at different pipeline stages. Here’s a battle-tested pattern used by data engineering teams processing terabytes daily.

Ingestion Layer (ORC → Parquet): If your data originates in Hive, ingest into ORC tables for ACID transaction support and incremental updates. Periodically compact ORC partitions into Parquet files for broader query engine compatibility. Tools like Apache Hudi and Delta Lake automate this compaction while maintaining transactional guarantees.

Processing Layer (Arrow): Load Parquet files into Arrow for in-memory vectorized processing. DuckDB and Polars both use Arrow as their internal representation, enabling zero-copy querying across Parquet datasets. For streaming use cases, Arrow Flight transmits data between services with gRPC-native efficiency — eliminating the serialization bottleneck of REST/JSON APIs.

Serving Layer (Parquet ± Arrow Flight): Store final analytical datasets as Parquet with ZSTD compression (best size/speed balance) and row group sizes tuned for your query patterns. For interactive dashboards, serve via Arrow Flight to enable sub-second query responses from BI tools. For batch reporting, direct Parquet reads from Spark or Trino suffice.

1
2
3
4
5
6
7
8
9
-- DuckDB: seamless multi-format querying
SELECT 
    category,
    COUNT(*) as count,
    AVG(score) as avg_score
FROM read_parquet('analytics/*.parquet')
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY avg_score DESC;

This pipeline design gives you Hive’s transactional ingestion, Arrow’s processing speed, and Parquet’s storage efficiency — each format doing what it does best. ’''

FAQ

Should I use Parquet or Arrow for my data lake?

Use both — they’re complementary. Store data in Parquet on disk (compressed, predicate-pushdown optimized) and use Arrow in memory for processing (zero-copy, vectorized). Tools like PyArrow handle the Parquet ↔ Arrow conversion seamlessly, giving you the best of both worlds.

Does Arrow replace Parquet?

No. Arrow is an in-memory format designed for processing speed and zero-copy interchange. Parquet is an on-disk format designed for storage efficiency and query optimization. Arrow can read/write Parquet files (via pyarrow.parquet), and many query engines use Arrow in-memory while storing Parquet on disk.

Can ORC files be used outside of Hive?

Yes, but with caveats. ORC readers exist for Python (pyorc), Presto, Spark, and Flink. However, the development velocity and cross-engine testing for ORC is lower than Parquet. For greenfield projects that aren’t Hive-centric, Parquet is generally the more portable choice.

How do columnar formats compare to row-based formats like Avro?

Columnar formats (Parquet, ORC) excel at analytical queries that read a subset of columns (e.g., SELECT AVG(price) FROM sales). Row-based formats (Avro, JSON Lines) are better for transactional workloads that read entire records. Use columnar for analytics, row-based for event streaming and message queues.

What about Apache Iceberg and Delta Lake?

Iceberg and Delta Lake are table formats (metadata layers) that sit on top of storage formats like Parquet and ORC. They add ACID transactions, time travel, and schema evolution at the table level. Both use Parquet as their default underlying storage format. Choose a table format for production data lakes requiring transactional guarantees.

How do I migrate existing CSV/JSON data to Parquet?

The simplest approach is using DuckDB or PyArrow:

1
2
-- DuckDB: one-liner conversion
COPY (SELECT * FROM read_csv_auto('data.csv')) TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

For large-scale migrations, Apache Spark or PyArrow with row group batching provides optimal throughput.


💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com