Chapter 8: The Trouble with Distributed Systems

2 min read

Core Concepts

Fundamental Challenges

Network unreliability
Clock synchronization issues
Process pauses
Partial failures

System Model

Synchronous model: Every operation takes bounded time
Asynchronous model: No guarantees on timing
Partial synchronous model: Sometimes synchronous, sometimes not

Unreliable Networks

Common Failures

Packet loss
Network partitions
Delays (queueing, congestion)
Asymmetric communication

Timeout Problems

How long to wait for response?
Too short: False failure detection
Too long: Slow failure detection

Unbounded Delays

No guarantee on response time
Queueing in network switches
TCP retransmission
Operating system scheduling

Unreliable Clocks

Problems with Clocks

Clock skew between machines
Clock jumps (NTP adjustments)
Monotonic vs wall-clock time

Timestamps for Ordering

Lamport timestamps: Capture causality
Vector timestamps: Capture full causality

Clock Synchronization

NTP: Network Time Protocol
GPS: More accurate
Atomic clocks: Most accurate

Limitations

Clock readings have confidence intervals
Cannot rely on clock for ordering

Process Pauses

The Problem

Single-threaded server pauses for GC
Other nodes think it's dead
Resumes and may cause conflicts

Solutions

Fencing tokens: Detect stale operations
Leader and locks: Ensure only one leader
Consensus algorithms: Handle process pauses

Truth and Lies

Fencing Tokens

Monotonically increasing tokens
Reject operations with old tokens
Prevent stale writes

Byzantine Faults

Nodes can lie or send incorrect data
Requires Byzantine fault tolerance
Used in blockchain, aerospace

CAP Theorem

Consistency: All nodes see same data
Availability: Every request gets response
Partition tolerance: System works despite network failures

Theorem: Can only have 2 of 3 in presence of partitions

Trade-offs

CP: Consistent but may be unavailable
AP: Available but may be inconsistent
CA: Not possible in distributed systems

Consistency and Consensus

Linearizability

Strongest consistency model
Operations appear atomic and instant
Equivalent to single-copy serial execution

Ordering Guarantees

Total order: All operations ordered
Causal order: Operations ordered by causality

Key Takeaways

Networks are unreliable - design for failures
Clocks are unreliable - don't use them for ordering
Process pauses are real - handle them explicitly
CAP theorem forces trade-offs in distributed systems
Linearizability is expensive but provides strong consistency