Processing large-scale data in batch mode remains a foundational requirement for data engineering pipelines. Whether you are running ETL jobs, building data warehouses, training machine learning models, or generating nightly reports, choosing the right batch processing engine impacts cost, performance, and operational complexity.
This guide compares three major open-source batch processing frameworks: Apache Spark, Apache Hadoop MapReduce, and Apache Tez — covering architecture, performance, deployment, and when to use each.
Comparison Table
| Feature | Apache Spark | Hadoop MapReduce | Apache Tez |
|---|---|---|---|
| Stars | 43,200+ | 15,500+ | 510+ |
| Processing Model | In-memory DAG execution | Disk-based map-reduce stages | DAG execution engine on YARN |
| Latency | Seconds to minutes (in-memory) | Minutes to hours (disk I/O) | Seconds to minutes (in-memory) |
| APIs | Scala, Java, Python, R, SQL | Java, Streaming (any language) | Java (via Hive, Pig, or custom) |
| Execution | Standalone, YARN, Kubernetes, Mesos | YARN only | YARN only |
| Fault Tolerance | RDD lineage reconstruction | Task re-execution from disk | DAG re-execution |
| ML Support | MLlib (built-in) | Mahout (external) | None (execution engine only) |
| Streaming | Structured Streaming (micro-batch) | Storm (separate project) | None (batch only) |
| Docker Image | bitnami/spark, apache/spark | bitnami/hadoop | No official image |
| Last Active | 2026 | 2026 | 2026 |
| Language | Scala | Java | Java |
Apache Spark: The In-Memory Processing Engine
Apache Spark is the dominant batch processing framework in modern data engineering. Its in-memory execution model makes it 10-100x faster than MapReduce for iterative workloads. Spark provides a unified engine for batch processing, streaming, SQL queries, machine learning, and graph computation.
Architecture
Spark runs as a driver-executor model:
- Driver: Orchestrates job execution, maintains DAG, schedules tasks
- Executors: Run tasks on cluster nodes, cache data in memory
- Cluster Manager: YARN, Kubernetes, or Spark Standalone
Docker Compose Deployment
| |
Running a PySpark Job
| |
Spark Submit Command
| |
Hadoop MapReduce: The Original Batch Engine
Hadoop MapReduce pioneered distributed batch processing and established the paradigm of splitting computation into map and reduce phases. While it has been largely superseded by Spark for most workloads, MapReduce remains relevant for specific use cases.
Architecture
MapReduce runs on top of HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator):
- NameNode: Manages HDFS metadata
- DataNode: Stores actual data blocks
- ResourceManager: Allocates cluster resources
- NodeManager: Manages resources on individual nodes
- ApplicationMaster: Negotiates resources for a specific job
Docker Compose with Hadoop
| |
Running a MapReduce Job
| |
When MapReduce Still Makes Sense
- Massive datasets on limited RAM: MapReduce writes intermediate results to disk, making it suitable for datasets that exceed cluster memory
- Compliance and auditability: Disk-based processing provides a natural audit trail of intermediate results
- Existing Hadoop ecosystem: Organizations with heavy investments in HDFS, Hive, and HBase may prefer MapReduce for consistency
- Simple ETL jobs: For straightforward map-filter-reduce operations, MapReduce’s simplicity can be an advantage
Apache Tez: The DAG Execution Engine
Apache Tez is a DAG (Directed Acyclic Graph) execution engine that runs on YARN. Unlike MapReduce’s rigid two-stage model, Tez allows arbitrary computation graphs, enabling optimizations like joining multiple operators into a single task.
Architecture
Tez sits between the application (Hive, Pig, or custom code) and YARN:
- DAG API: Applications define computation as a directed acyclic graph
- Tez ApplicationMaster: Manages DAG execution on YARN
- Task Schedulers: Optimize task placement based on data locality
Running Tez with Hive
| |
Tez vs MapReduce Performance
Tez typically outperforms MapReduce by 2-10x on equivalent workloads because:
- Fewer I/O operations: Multiple map-reduce stages are collapsed into a single DAG
- Better resource utilization: Containers are reused across tasks instead of being allocated/deallocated per stage
- Dynamic optimization: Tez can re-plan the DAG at runtime based on intermediate result sizes
Choosing the Right Batch Processing Engine
Choose Apache Spark when:
- You need the fastest possible batch processing with in-memory execution
- Your team uses Python, Scala, or R (PySpark, SparkR)
- You want a unified platform for batch, streaming, ML, and SQL
- You need Kubernetes or standalone deployment (not just YARN)
- You are building modern data pipelines with Parquet/Delta Lake
Choose Hadoop MapReduce when:
- You have petabyte-scale datasets that exceed available cluster memory
- You need maximum fault tolerance with disk-backed intermediate results
- Your organization has an existing Hadoop ecosystem (HDFS, Hive, HBase)
- You process simple map-filter-reduce workloads without complex DAGs
Choose Apache Tez when:
- You are running Hive or Pig queries on YARN and want better performance than MapReduce
- You need DAG optimization without adopting a full Spark deployment
- Your workloads benefit from container reuse across computation stages
- You want incremental performance improvements within an existing Hadoop stack
Why Self-Host Batch Processing?
Running batch processing engines on self-hosted infrastructure provides significant advantages over cloud-managed alternatives:
- Cost predictability: Cloud Spark (Databricks, EMR) charges per DPU/hour, which can become expensive for nightly batch jobs processing terabytes. Self-hosted Spark on commodity hardware runs at a fixed cost
- Data sovereignty: Batch processing often involves sensitive data (financial records, healthcare data, PII). Keeping computation on-premises avoids data transfer to cloud regions with different privacy regulations
- Network performance: Processing data where it lives eliminates the cost and latency of moving terabytes to cloud storage for computation. Self-hosted clusters co-located with data sources (databases, data lakes, IoT pipelines) minimize data movement
- Custom hardware acceleration: Self-hosted clusters can use GPUs for ML workloads (Spark MLlib with RAPIDS), NVMe storage for shuffle operations, or high-bandwidth NICs for data-intensive stages
- No vendor lock-in: Open-source Spark, MapReduce, and Tez run identically on any infrastructure. You avoid proprietary extensions from Databricks, AWS EMR, or Google Dataproc that make migration difficult
For data pipeline orchestration, see our Apache Airflow vs Dagster vs Prefect guide. If you need workflow orchestration with DAG scheduling, check our Dagu vs Netflix Conductor vs Airflow comparison. For analytics databases, our ClickHouse vs Druid vs Pinot comparison covers real-time query engines.
FAQ
Is Apache Spark faster than Hadoop MapReduce?
Yes, Spark is typically 10-100x faster than MapReduce for most workloads because it processes data in memory rather than writing intermediate results to disk after each map and reduce stage. The speed advantage is most pronounced for iterative algorithms (machine learning, graph processing) where the same data is accessed multiple times.
Can Spark run on existing Hadoop clusters?
Yes. Spark can run on YARN (Hadoop’s resource manager) and read data from HDFS. This means you can deploy Spark on an existing Hadoop cluster without replacing HDFS. Many organizations run both MapReduce and Spark on the same YARN cluster, using MapReduce for legacy jobs and Spark for new workloads.
What is the difference between Apache Tez and Apache Spark?
Tez is a DAG execution engine that runs on YARN and is primarily used as a backend for Hive and Pig. Spark is a standalone processing engine with its own cluster manager, APIs, and ecosystem (MLlib, Structured Streaming, Spark SQL). Tez improves MapReduce performance within the Hadoop ecosystem; Spark replaces MapReduce entirely with a different processing model.
Does Apache Tez support Python?
Not natively. Tez is a Java-based execution engine. Python support comes through higher-level tools that use Tez as a backend — for example, Hive queries written in SQL can run on Tez. If you need a Python-first batch processing framework, Apache Spark with PySpark is the better choice.
How much memory does Spark need?
Spark’s memory requirements depend on your data size and transformations. A good starting point is 4-8 GB of executor memory per core. Spark needs enough memory to cache RDDs/DataFrames and perform shuffle operations. For a 1 TB dataset, a cluster with 10 executors at 8 GB each (80 GB total) is a reasonable starting point.
Can I run Spark without Hadoop?
Yes. Spark can run in standalone mode, on Kubernetes, or on Mesos without any Hadoop components. It can also read from cloud storage (S3, GCS, Azure Blob), local filesystems, or databases directly. Hadoop (HDFS + YARN) is only needed if you want Spark to run on a Hadoop cluster.