Chapter 1: Reliable, Scalable, and Maintainable Applications

4 min read

Core Concepts

Data-Intensive vs Compute-Intensive

Data-intensive: The limiting factor is the amount of data, complexity, and speed of change (not CPU power)
Most modern applications are data-intensive
Common building blocks: databases, caches, search indexes, stream processing, batch processing

Thinking About Data Systems

Boundaries between categories are blurring (e.g., Redis as database + message queue, Kafka as message queue + database)
Applications increasingly combine multiple tools stitched together with application code
When you combine tools, you become a data system designer

Three Fundamental Properties

1. Reliability

Definition: The system continues to work correctly even when faults occur.

Types of Faults:

Fault Type	Characteristics	Example
Hardware	Random, independent	Disk crash, power outage
Software	Systematic, correlated	Bug triggered by specific input, cascading failures
Human	Unpredictable	Configuration errors, operator mistakes

Key Distinction: Fault ≠ Failure

Fault: One component deviating from spec
Failure: System as a whole stops providing required service

Fault Tolerance Techniques:

Hardware redundancy (RAID, dual power supplies)
Software fault tolerance (replication, retry logic)
Deliberate fault injection (Netflix Chaos Monkey)
Process isolation, monitoring, rollback capabilities

Human Error Mitigation:

Design systems to minimize error opportunities
Decouple error-prone areas from failure-prone areas
Provide sandbox environments for experimentation
Thorough testing at all levels
Quick recovery mechanisms
Detailed monitoring and telemetry

2. Scalability

Definition: The system's ability to cope with increased load.

Describing Load:

Load parameters: Metrics that characterize your system's load
- Requests per second
- Read/write ratio
- Number of active users
- Cache hit rate
- Data volume

Twitter Example (2012 data):

4.6k tweets/sec average (12k peak)
300k home timeline reads/sec
Fan-out challenge: average tweet delivered to 75 followers
Celebrity problem: some users have 30M+ followers

Approaches:

Simple relational: Query all followed users' tweets at read time
Fan-out on write: Pre-compute timelines, write to each follower's cache
Hybrid: Fan-out for normal users, fetch separately for celebrities

Describing Performance:

Metric	Description	Use Case
Throughput	Records processed per second	Batch processing
Response time	Time between request and response	Online systems
Latency	Time waiting to be handled (not including service time)	Detailed analysis

Response Time Analysis:

Use percentiles, not just averages
p50 (median): Half of requests are faster
p95, p99, p99.9: Tail latencies - critical for user experience
Tail latency amplification: When multiple backend calls are needed, one slow call slows everything

Coping with Load:

Vertical scaling (scale up): More powerful machine
Horizontal scaling (scale out): Distribute load across machines
Elastic: Auto-add resources when load increases
Manual scaling: Human analyzes capacity

Key Insight: There is no generic "magic scaling sauce" - architecture must be specific to the application's load patterns.

3. Maintainability

Definition: Making life easier for engineering and operations teams.

Three Design Principles:

Operability

Make it easy for operations teams to keep the system running:

Good monitoring and visibility
Support for automation
Avoid dependency on individual machines
Good documentation and operational model
Predictable behavior

Simplicity

Make it easy for new engineers to understand the system:

Remove accidental complexity (not inherent to the problem)
Use abstraction to hide implementation details
Good abstractions enable reuse and higher quality

Evolvability

Make it easy for engineers to make changes:

Adapt to changing requirements
Agile practices at the system level
Related to simplicity and good abstractions

Key Takeaways

Reliability, scalability, and maintainability are interdependent - optimizing one may affect others
Fault tolerance is about handling specific types of faults, not all possible faults
Scalability is not a binary property - it's about having strategies for coping with growth
Percentiles matter more than averages for understanding user experience
Good abstractions are key to managing complexity and enabling evolvability

Core Concepts #

Data-Intensive vs Compute-Intensive #

Thinking About Data Systems #

Three Fundamental Properties #

1. Reliability #

2. Scalability #

3. Maintainability #

Operability #

Simplicity #

Evolvability #

Key Takeaways #