Chapter 8: The Trouble with Distributed Systems
2 min readCore Concepts
Fundamental Challenges
- Network unreliability
- Clock synchronization issues
- Process pauses
- Partial failures
System Model
- Synchronous model: Every operation takes bounded time
- Asynchronous model: No guarantees on timing
- Partial synchronous model: Sometimes synchronous, sometimes not
Unreliable Networks
Common Failures
- Packet loss
- Network partitions
- Delays (queueing, congestion)
- Asymmetric communication
Timeout Problems
- How long to wait for response?
- Too short: False failure detection
- Too long: Slow failure detection
Unbounded Delays
- No guarantee on response time
- Queueing in network switches
- TCP retransmission
- Operating system scheduling
Unreliable Clocks
Problems with Clocks
- Clock skew between machines
- Clock jumps (NTP adjustments)
- Monotonic vs wall-clock time
Timestamps for Ordering
- Lamport timestamps: Capture causality
- Vector timestamps: Capture full causality
Clock Synchronization
- NTP: Network Time Protocol
- GPS: More accurate
- Atomic clocks: Most accurate
Limitations
- Clock readings have confidence intervals
- Cannot rely on clock for ordering
Process Pauses
The Problem
- Single-threaded server pauses for GC
- Other nodes think it's dead
- Resumes and may cause conflicts
Solutions
- Fencing tokens: Detect stale operations
- Leader and locks: Ensure only one leader
- Consensus algorithms: Handle process pauses
Truth and Lies
Fencing Tokens
- Monotonically increasing tokens
- Reject operations with old tokens
- Prevent stale writes
Byzantine Faults
- Nodes can lie or send incorrect data
- Requires Byzantine fault tolerance
- Used in blockchain, aerospace
CAP Theorem
- Consistency: All nodes see same data
- Availability: Every request gets response
- Partition tolerance: System works despite network failures
Theorem: Can only have 2 of 3 in presence of partitions
Trade-offs
- CP: Consistent but may be unavailable
- AP: Available but may be inconsistent
- CA: Not possible in distributed systems
Consistency and Consensus
Linearizability
- Strongest consistency model
- Operations appear atomic and instant
- Equivalent to single-copy serial execution
Ordering Guarantees
- Total order: All operations ordered
- Causal order: Operations ordered by causality
Key Takeaways
- Networks are unreliable - design for failures
- Clocks are unreliable - don't use them for ordering
- Process pauses are real - handle them explicitly
- CAP theorem forces trade-offs in distributed systems
- Linearizability is expensive but provides strong consistency