Chapter 1: Reliable, Scalable, and Maintainable Applications
4 min readCore Concepts
Data-Intensive vs Compute-Intensive
- Data-intensive: The limiting factor is the amount of data, complexity, and speed of change (not CPU power)
- Most modern applications are data-intensive
- Common building blocks: databases, caches, search indexes, stream processing, batch processing
Thinking About Data Systems
- Boundaries between categories are blurring (e.g., Redis as database + message queue, Kafka as message queue + database)
- Applications increasingly combine multiple tools stitched together with application code
- When you combine tools, you become a data system designer
Three Fundamental Properties
1. Reliability
Definition: The system continues to work correctly even when faults occur.
Types of Faults:
| Fault Type | Characteristics | Example |
|---|---|---|
| Hardware | Random, independent | Disk crash, power outage |
| Software | Systematic, correlated | Bug triggered by specific input, cascading failures |
| Human | Unpredictable | Configuration errors, operator mistakes |
Key Distinction: Fault ≠ Failure
- Fault: One component deviating from spec
- Failure: System as a whole stops providing required service
Fault Tolerance Techniques:
- Hardware redundancy (RAID, dual power supplies)
- Software fault tolerance (replication, retry logic)
- Deliberate fault injection (Netflix Chaos Monkey)
- Process isolation, monitoring, rollback capabilities
Human Error Mitigation:
- Design systems to minimize error opportunities
- Decouple error-prone areas from failure-prone areas
- Provide sandbox environments for experimentation
- Thorough testing at all levels
- Quick recovery mechanisms
- Detailed monitoring and telemetry
2. Scalability
Definition: The system's ability to cope with increased load.
Describing Load:
- Load parameters: Metrics that characterize your system's load
- Requests per second
- Read/write ratio
- Number of active users
- Cache hit rate
- Data volume
Twitter Example (2012 data):
- 4.6k tweets/sec average (12k peak)
- 300k home timeline reads/sec
- Fan-out challenge: average tweet delivered to 75 followers
- Celebrity problem: some users have 30M+ followers
Approaches:
- Simple relational: Query all followed users' tweets at read time
- Fan-out on write: Pre-compute timelines, write to each follower's cache
- Hybrid: Fan-out for normal users, fetch separately for celebrities
Describing Performance:
| Metric | Description | Use Case |
|---|---|---|
| Throughput | Records processed per second | Batch processing |
| Response time | Time between request and response | Online systems |
| Latency | Time waiting to be handled (not including service time) | Detailed analysis |
Response Time Analysis:
- Use percentiles, not just averages
- p50 (median): Half of requests are faster
- p95, p99, p99.9: Tail latencies - critical for user experience
- Tail latency amplification: When multiple backend calls are needed, one slow call slows everything
Coping with Load:
- Vertical scaling (scale up): More powerful machine
- Horizontal scaling (scale out): Distribute load across machines
- Elastic: Auto-add resources when load increases
- Manual scaling: Human analyzes capacity
Key Insight: There is no generic "magic scaling sauce" - architecture must be specific to the application's load patterns.
3. Maintainability
Definition: Making life easier for engineering and operations teams.
Three Design Principles:
Operability
Make it easy for operations teams to keep the system running:
- Good monitoring and visibility
- Support for automation
- Avoid dependency on individual machines
- Good documentation and operational model
- Predictable behavior
Simplicity
Make it easy for new engineers to understand the system:
- Remove accidental complexity (not inherent to the problem)
- Use abstraction to hide implementation details
- Good abstractions enable reuse and higher quality
Evolvability
Make it easy for engineers to make changes:
- Adapt to changing requirements
- Agile practices at the system level
- Related to simplicity and good abstractions
Key Takeaways
- Reliability, scalability, and maintainability are interdependent - optimizing one may affect others
- Fault tolerance is about handling specific types of faults, not all possible faults
- Scalability is not a binary property - it's about having strategies for coping with growth
- Percentiles matter more than averages for understanding user experience
- Good abstractions are key to managing complexity and enabling evolvability