Chapter 11: Stream Processing
2 min readCore Concepts
Batch vs Stream Processing
| Aspect | Batch | Stream |
|---|---|---|
| Input | Bounded dataset | Unbounded stream |
| Output | New dataset | Updates to output |
| Latency | High (minutes/hours) | Low (milliseconds/seconds) |
| Use case | ETL, analytics | Real-time monitoring, alerts |
Messaging Systems
Message Brokers:
- Store messages temporarily
- Producer-consumer pattern
- Examples: RabbitMQ, ActiveMQ
Log-Based Messaging:
- Append-only log
- Kafka, Kinesis
- Durable, replayable
Log-Based Messaging
How It Works
- Producer appends message to log
- Consumer reads from log
- Messages persisted until consumed
- Multiple consumers can read same log
Advantages
- Durability: Messages survive crashes
- Replayability: Re-process old messages
- Scalability: Partition log across machines
Consumer Groups
- Group of consumers share partition
- Each partition consumed by one consumer
- Rebalancing when consumers join/leave
Change Data Capture (CDC)
Concept
- Capture changes from database
- Stream changes to other systems
- Keep derived data in sync
Implementations
Trigger-Based:
- Database triggers capture changes
- Write to change log
- Simple but has overhead
Log-Based:
- Read database's write-ahead log
- No overhead on database
- More complex implementation
Log Compaction
- Remove old versions of keys
- Keep only latest value
- Reduce storage, maintain current state
Event Sourcing
Concept
- Store all events, not just current state
- Current state derived from event log
- Immutable event history
Commands vs Events
- Command: Request to do something
- Event: Record that something happened
Event Sourcing Benefits
- Complete audit trail
- Time-travel debugging
- Derived views from same events
Stream Processing
Stream Tables
- Continuous query on event stream
- Results updated in real-time
- Join stream with table (enrichment)
Stream-Stream Join
- Join two event streams
- Time-windowed (events within time range)
- Complex state management
Window Operations
Tumbling Window: Fixed-size, non-overlapping Hopping Window: Fixed-size, overlapping Sliding Window: Time-based, continuous
Exactly-Once Processing
- Each event processed exactly once
- Challenging in distributed systems
- Checkpointing and transactional processing
Stream Analytics
Time-Window Aggregation
- Count events per time window
- Compute averages, sums
- Real-time dashboards
Pattern Detection
- Complex event processing (CEP)
- Detect sequences of events
- Fraud detection, monitoring
Fault Tolerance
Checkpointing
- Periodically save state
- Restart from checkpoint on failure
- Used by Flink, Spark Streaming
Microbatching
- Process small batches of events
- Checkpoint between batches
- Used by Spark Streaming
Transactional Writes
- Atomic updates to output
- Exactly-once semantics
- Achieved via transactions
Key Takeaways
- Log-based messaging (Kafka) is the foundation for stream processing
- Change Data Capture keeps derived data in sync with databases
- Event Sourcing provides complete history and auditability
- Windowing handles unbounded streams
- Exactly-once processing is achievable but complex
- Stream and batch processing are converging (Lambda/Kappa architectures)