Chapter 10: Batch Processing
2 min readCore Concepts
Batch Processing Definition
- Process large amounts of data in bulk
- Input is bounded dataset
- Output is new dataset or reports
- No user interaction during processing
Unix Philosophy for Batch Processing
- Simple tools that do one thing well
- Compose tools via pipes
- Handle failures gracefully
Batch Processing with Unix Tools
Simple Log Analysis
cat server.log |
grep "GET /index.html" |
awk '{print $7}' |
sort |
uniq -c |
sort -rn |
head -10
Limitations of Unix Tools
- Only process one record at a time
- No indexes
- Limited to single machine
MapReduce
Concept
- Distributed batch processing framework
- Write map and reduce functions
- Framework handles distribution, fault tolerance
Map Function
def map(key, value):
for word in value.split():
emit(word, 1)
Reduce Function
def reduce(key, values):
emit(key, sum(values))
Execution Model
- Map phase: Parallel across input splits
- Shuffle phase: Group by key
- Reduce phase: Parallel across keys
- Output: Written to distributed filesystem
Fault Tolerance
- Task retry: Failed tasks restarted
- Speculative execution: Start backup tasks
- Task tracking: Coordinator monitors progress
Joins in MapReduce
Map-Side Join:
- One input already sorted
- Merge during map phase
- Efficient for pre-sorted data
Reduce-Side Join:
- Both inputs shuffled by join key
- Reduce function joins records
- More general but slower
Broadcast Join:
- Small dataset fits in memory
- Send to all mappers
- Join in memory
Limitations
- Multiple passes over data
- High disk I/O
- Complex workflows hard to manage
Beyond MapReduce
Dataflow Engines
- Apache Spark
- Apache Flink
- Apache Tez
Advantages:
- In-memory processing
- DAG execution model
- Better performance
High-Level APIs
- SQL interfaces
- Declarative queries
- Automatic optimization
Specialization
- Graph processing (Pregel, Giraph)
- Machine learning (Mahout, MLlib)
- Stream processing
The Output of Batch Workflows
Key-Value Stores
- Batch job produces key-value data
- Used for joins and lookups
Search Indexes
- Batch job builds search index
- Inverted index from documents
Materialized Views
- Pre-computed query results
- Updated periodically
Philosophy of Batch Process Outputs
Reprocessability
- Keep raw input data
- Re-run pipeline if needed
- Debugging and auditing
Deriving Several Views
- Multiple outputs from same input
- Event log → multiple derived datasets
Key Takeaways
- MapReduce pioneered distributed batch processing
- Dataflow engines (Spark, Flink) are more efficient
- Joins are expensive but necessary for data integration
- Reprocessability is key advantage of batch processing
- Derived data can be recomputed from source