Chapter 10: Batch Processing

3 min read

Core Concepts

Batch Processing Definition

Process large amounts of data in bulk
Input is bounded dataset
Output is new dataset or reports
No user interaction during processing

Unix Philosophy for Batch Processing

Simple tools that do one thing well
Compose tools via pipes
Handle failures gracefully

Batch Processing with Unix Tools

Simple Log Analysis

cat server.log |
  grep "GET /index.html" |
  awk '{print $7}' |
  sort |
  uniq -c |
  sort -rn |
  head -10

Limitations of Unix Tools

Only process one record at a time
No indexes
Limited to single machine

MapReduce

Concept

Distributed batch processing framework
Write map and reduce functions
Framework handles distribution, fault tolerance

Map Function

def map(key, value):
    for word in value.split():
        emit(word, 1)

Reduce Function

def reduce(key, values):
    emit(key, sum(values))

Execution Model

Map phase: Parallel across input splits
Shuffle phase: Group by key
Reduce phase: Parallel across keys
Output: Written to distributed filesystem

Fault Tolerance

Task retry: Failed tasks restarted
Speculative execution: Start backup tasks
Task tracking: Coordinator monitors progress

Joins in MapReduce

Map-Side Join:

One input already sorted
Merge during map phase
Efficient for pre-sorted data

Reduce-Side Join:

Both inputs shuffled by join key
Reduce function joins records
More general but slower

Broadcast Join:

Small dataset fits in memory
Send to all mappers
Join in memory

Limitations

Multiple passes over data
High disk I/O
Complex workflows hard to manage

Beyond MapReduce

Dataflow Engines

Apache Spark
Apache Flink
Apache Tez

Advantages:

In-memory processing
DAG execution model
Better performance

High-Level APIs

SQL interfaces
Declarative queries
Automatic optimization

Specialization

Graph processing (Pregel, Giraph)
Machine learning (Mahout, MLlib)
Stream processing

The Output of Batch Workflows

Key-Value Stores

Batch job produces key-value data
Used for joins and lookups

Search Indexes

Batch job builds search index
Inverted index from documents

Materialized Views

Pre-computed query results
Updated periodically

Philosophy of Batch Process Outputs

Reprocessability

Keep raw input data
Re-run pipeline if needed
Debugging and auditing

Deriving Several Views

Multiple outputs from same input
Event log → multiple derived datasets

Key Takeaways

MapReduce pioneered distributed batch processing
Dataflow engines (Spark, Flink) are more efficient
Joins are expensive but necessary for data integration
Reprocessability is key advantage of batch processing
Derived data can be recomputed from source

Core Concepts #

Batch Processing Definition #

Unix Philosophy for Batch Processing #

Batch Processing with Unix Tools #

Simple Log Analysis #

Limitations of Unix Tools #

MapReduce #

Concept #

Map Function #

Reduce Function #

Execution Model #

Fault Tolerance #

Joins in MapReduce #

Limitations #

Beyond MapReduce #

Dataflow Engines #

High-Level APIs #

Specialization #

The Output of Batch Workflows #

Key-Value Stores #

Search Indexes #

Materialized Views #

Philosophy of Batch Process Outputs #

Reprocessability #

Deriving Several Views #

Key Takeaways #

Core Concepts

Batch Processing Definition

Unix Philosophy for Batch Processing

Batch Processing with Unix Tools

Simple Log Analysis

Limitations of Unix Tools

MapReduce

Concept

Map Function

Reduce Function

Execution Model

Fault Tolerance

Joins in MapReduce

Limitations

Beyond MapReduce

Dataflow Engines

High-Level APIs

Specialization

The Output of Batch Workflows

Key-Value Stores

Search Indexes

Materialized Views

Philosophy of Batch Process Outputs

Reprocessability

Deriving Several Views

Key Takeaways