Chapter 3: Storage and Retrieval

7 min read

Core Concepts

Database Fundamentals

A database needs to do two things:

Store data when you give it
Retrieve data when you ask for it

Why Storage Engine Knowledge Matters

You probably won't implement your own storage engine
But you need to select the right one for your application
Need to understand internals to tune performance

Data Structures That Power Your Database

Simple Key-Value Store (Bash Example)

db_set() {
    echo "$1,$2" >> database
}

db_get() {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

How it works:

Append-only log file
Each call to db_set appends to file
db_get scans entire file for key (O(n) - terrible performance!)

Hash Indexes

Concept: Keep in-memory hash map mapping keys to byte offsets

In-memory hash table:
"key1" → offset 0
"key2" → offset 25
"key3" → offset 50

Bitcask (Riak's default storage engine):

All keys must fit in RAM
Values can be larger than RAM (loaded from disk)
High-performance for write-heavy workloads

Segment Files and Compaction:

Break log into segments of fixed size
Compaction: Remove duplicate keys, keep only latest
Merging: Combine multiple segments

Advantages of Append-Only:

Sequential writes (much faster than random)
Concurrency and crash recovery simpler
Merging avoids fragmentation

Limitations:

Hash table must fit in memory
Range queries not efficient (can't scan key ranges)

SSTables and LSM-Trees

Sorted String Tables (SSTables)

Key Change: Require key-value pairs to be sorted by key

Advantages over Hash Indexes:

Efficient Merging (like mergesort):
- Read input files side by side
- Copy lowest key to output
- O(n) merging even for large files
Sparse In-Memory Index:
- Don't need all keys in memory
- Just offsets for some keys
- Can scan between known offsets
Compression:
- Group records into blocks
- Compress before writing to disk
- Reduce I/O bandwidth

Constructing SSTables

In-Memory Balanced Tree (Red-Black, AVL):

Write goes to in-memory memtable
When memtable exceeds threshold → flush to disk as SSTable
Reads: Check memtable → newest SSTable → older SSTables
Background compaction merges segments

Crash Recovery:

Write-ahead log (WAL) for durability
Every write appended to log before memtable
Log discarded when memtable flushed to SSTable

LSM-Tree (Log-Structured Merge-Tree)

Used by: LevelDB, RocksDB, Cassandra, HBase

Components:

Memtable: In-memory sorted structure
SSTables: On-disk sorted files
WAL: For crash recovery
Compaction: Background merging

graph TD
    W[Write arrives] --> M[Memtable<br/>in-memory sorted]
    W --> L[Write-Ahead Log<br/>for crash recovery]
    M -- "threshold reached" --> F[Flush to disk<br/>new SSTable]
    F --> S[(SSTable files<br/>on disk)]
    S --> C[Background<br/>compaction & merge]
    C --> S
    R[Read request] --> M
    R --> B[Bloom filter<br/>check SSTables]
    B --> S

Performance Optimizations:

Bloom filters: Skip SSTables that don't contain a key
Size-tiered compaction: Merge smaller SSTables into larger ones
Leveled compaction: Split key ranges into levels

B-Trees

Concept

Most widely used indexing structure
Used in almost all relational databases
Keep key-value pairs sorted by key

Structure

Fixed-size blocks/pages (typically 4KB)
One page designated as root
Pages contain keys and references to child pages
Branching factor: Number of child references per page

Lookup

Start at root page
Follow references based on key ranges
Eventually reach leaf page with actual data
Most databases: 3-4 levels deep

graph TD
    R["Root Page<br/>[200 | 400 | 600]"] -->|"key < 200"| I1["Internal Page<br/>[50 | 100 | 150]"]
    R -->|"200 ≤ key < 400"| I2["Internal Page<br/>[250 | 300 | 350]"]
    R -->|"400 ≤ key < 600"| I3["Internal Page<br/>[450 | 500 | 550]"]
    R -->|"key ≥ 600"| I4["Internal Page<br/>[650 | 700 | 750]"]
    I1 --> L1["Leaf Page<br/>rows with keys 0-200"]
    I2 --> L2["Leaf Page<br/>rows with keys 200-400"]
    I3 --> L3["Leaf Page<br/>rows with keys 400-600"]
    I4 --> L4["Leaf Page<br/>rows with keys 600+"]

Updating B-Trees

Insert:

Find appropriate leaf page
Add key-value pair
If page full → split into two
Update parent page

Write-Ahead Log (WAL):

Every modification written to WAL first
Used for crash recovery
Protects against corrupted index

B-Tree Optimizations

Copy-on-write: Write modified page to new location
Abbreviated keys: Store partial keys in interior pages
Sibling pointers: Link leaf pages for efficient scanning
Fractal trees: Borrow log-structured ideas

Comparing B-Trees and LSM-Trees

LSM-Tree Advantages

Aspect	LSM-Tree	B-Tree
Write amplification	Lower	Higher
Write throughput	Higher	Lower
Storage efficiency	Better compression	Fragmentation
Sequential I/O	Yes	No

LSM-Tree Disadvantages

Aspect	LSM-Tree	B-Tree
Read performance	Slower (check multiple structures)	Faster
Compaction overhead	Can interfere with queries	None
Write bandwidth	Shared between writes and compaction	Dedicated to writes
Transaction isolation	Multiple copies of key	Single location

Write Amplification

LSM-tree: Data rewritten during compaction
B-tree: Data written at least twice (WAL + page)
Concerning for SSDs (limited write cycles)

When to Choose What

graph TD
    Q["What's your workload?"] --> W{"Write-heavy?"}
    W -->|Yes| LSM["LSM-Tree<br/>✓ High write throughput<br/>✓ Better compression<br/>✓ Sequential I/O"]
    W -->|No| R{"Need predictable<br/>read latency?"}
    R -->|Yes| BT["B-Tree<br/>✓ Fast reads<br/>✓ Predictable latency<br/>✓ Strong transactions"]
    R -->|No| LSM

Choose LSM-Tree when:

Write-heavy workload
Sequential writes preferred
Storage efficiency matters

Choose B-Tree when:

Read-heavy workload
Predictable latency required
Strong transactional semantics needed

Other Indexing Structures

Secondary Indexes

Not unique (many rows can have same key)
Can be built from key-value indexes
Used for efficient joins

Storing Values

Heap File:

Index stores reference to row location
Row stored in no particular order
Avoids data duplication

Clustered Index:

Row data stored directly in index
Faster reads (no extra hop)
Used by MySQL InnoDB (primary key)

Covering Index:

Stores some columns in index
Query answered using index alone
Trade-off: faster reads vs more storage

Multi-Column Indexes

Concatenated Index:

Combine multiple fields into one key
Like old-fashioned phone book (lastname, firstname)
Useful for queries on leading columns

Multi-Dimensional Indexes:

For geospatial data, weather data
B-tree/LSM-tree can't efficiently query multiple dimensions
Use R-trees or space-filling curves

Full-Text Search

Fuzzy querying (typo tolerance)
Lucene uses SSTable-like structure
In-memory index as finite state automaton
Levenshtein automaton for edit distance

In-Memory Databases

Keep entire dataset in RAM
Durability via WAL or snapshots
Products: VoltDB, MemSQL, Redis, Memcached
Very fast (no disk I/O for reads)

Transaction Processing or Analytics?

OLTP (Online Transaction Processing)

User-facing applications
Many queries, each touching few records
Read and write via index
High throughput of short transactions

OLAP (Online Analytical Processing)

Analytics, reporting
Few queries, each touching many records
Read-only, full table scans
Low throughput of heavy queries

Data Warehousing

Copy of OLTP data for analytics
Schema optimized for queries (star/snowflake)
ETL process to load data

Star and Snowflake Schemas

Star Schema:

Central fact table (events, transactions)
Dimension tables (products, customers, time)
Simple joins

Snowflake Schema:

Dimension tables further normalized
More complex joins
Saves space

Column-Oriented Storage

Concept

Store each column separately
Much better for analytics queries

Benefits

Compression:

Similar values in same column → excellent compression
Bitmap indexes for low-cardinality columns

Vectorized Processing:

Process multiple values at once
Better CPU cache utilization

Column Families (Bigtable/Cassandra/HBase)

Group related columns together
Compromise between row-oriented and column-oriented

Writing to Column-Oriented Storage

Use LSM-tree style approach
Buffer writes in memory
Flush to disk in sorted segments

Key Takeaways

Indexing trade-off: Speed up reads, slow down writes
LSM-trees: Better for write-heavy workloads
B-trees: Better for read-heavy workloads
Column-oriented storage: Much better for analytics
OLTP vs OLAP: Different access patterns need different optimizations
Data locality: Grouping related data improves performance

Core Concepts #

Database Fundamentals #

Why Storage Engine Knowledge Matters #

Data Structures That Power Your Database #

Simple Key-Value Store (Bash Example) #

Hash Indexes #

SSTables and LSM-Trees #

Sorted String Tables (SSTables) #

Constructing SSTables #

LSM-Tree (Log-Structured Merge-Tree) #

B-Trees #

Concept #

Structure #

Lookup #

Updating B-Trees #

B-Tree Optimizations #

Comparing B-Trees and LSM-Trees #

LSM-Tree Advantages #

LSM-Tree Disadvantages #

Write Amplification #

When to Choose What #

Other Indexing Structures #

Secondary Indexes #

Storing Values #

Multi-Column Indexes #

Full-Text Search #

In-Memory Databases #

Transaction Processing or Analytics? #

OLTP (Online Transaction Processing) #

OLAP (Online Analytical Processing) #

Data Warehousing #

Star and Snowflake Schemas #

Column-Oriented Storage #

Concept #

Benefits #

Column Families (Bigtable/Cassandra/HBase) #

Writing to Column-Oriented Storage #

Key Takeaways #

Core Concepts

Database Fundamentals

Why Storage Engine Knowledge Matters

Data Structures That Power Your Database

Simple Key-Value Store (Bash Example)

Hash Indexes

SSTables and LSM-Trees

Sorted String Tables (SSTables)

Constructing SSTables

LSM-Tree (Log-Structured Merge-Tree)

B-Trees

Concept

Structure

Lookup

Updating B-Trees

B-Tree Optimizations

Comparing B-Trees and LSM-Trees

LSM-Tree Advantages

LSM-Tree Disadvantages

Write Amplification

When to Choose What

Other Indexing Structures

Secondary Indexes

Storing Values

Multi-Column Indexes

Full-Text Search

In-Memory Databases

Transaction Processing or Analytics?

OLTP (Online Transaction Processing)

OLAP (Online Analytical Processing)

Data Warehousing

Star and Snowflake Schemas

Column-Oriented Storage

Concept

Benefits

Column Families (Bigtable/Cassandra/HBase)

Writing to Column-Oriented Storage

Key Takeaways