Chapter 3: Storage and Retrieval
6 min readCore Concepts
Database Fundamentals
A database needs to do two things:
- Store data when you give it
- Retrieve data when you ask for it
Why Storage Engine Knowledge Matters
- You probably won't implement your own storage engine
- But you need to select the right one for your application
- Need to understand internals to tune performance
Data Structures That Power Your Database
Simple Key-Value Store (Bash Example)
db_set() {
echo "$1,$2" >> database
}
db_get() {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
How it works:
- Append-only log file
- Each call to
db_setappends to file db_getscans entire file for key (O(n) - terrible performance!)
Hash Indexes
Concept: Keep in-memory hash map mapping keys to byte offsets
In-memory hash table:
"key1" → offset 0
"key2" → offset 25
"key3" → offset 50
Bitcask (Riak's default storage engine):
- All keys must fit in RAM
- Values can be larger than RAM (loaded from disk)
- High-performance for write-heavy workloads
Segment Files and Compaction:
- Break log into segments of fixed size
- Compaction: Remove duplicate keys, keep only latest
- Merging: Combine multiple segments
Advantages of Append-Only:
- Sequential writes (much faster than random)
- Concurrency and crash recovery simpler
- Merging avoids fragmentation
Limitations:
- Hash table must fit in memory
- Range queries not efficient (can't scan key ranges)
SSTables and LSM-Trees
Sorted String Tables (SSTables)
Key Change: Require key-value pairs to be sorted by key
Advantages over Hash Indexes:
-
Efficient Merging (like mergesort):
- Read input files side by side
- Copy lowest key to output
- O(n) merging even for large files
-
Sparse In-Memory Index:
- Don't need all keys in memory
- Just offsets for some keys
- Can scan between known offsets
-
Compression:
- Group records into blocks
- Compress before writing to disk
- Reduce I/O bandwidth
Constructing SSTables
In-Memory Balanced Tree (Red-Black, AVL):
- Write goes to in-memory memtable
- When memtable exceeds threshold → flush to disk as SSTable
- Reads: Check memtable → newest SSTable → older SSTables
- Background compaction merges segments
Crash Recovery:
- Write-ahead log (WAL) for durability
- Every write appended to log before memtable
- Log discarded when memtable flushed to SSTable
LSM-Tree (Log-Structured Merge-Tree)
Used by: LevelDB, RocksDB, Cassandra, HBase
Components:
- Memtable: In-memory sorted structure
- SSTables: On-disk sorted files
- WAL: For crash recovery
- Compaction: Background merging
Performance Optimizations:
- Bloom filters: Skip SSTables that don't contain a key
- Size-tiered compaction: Merge smaller SSTables into larger ones
- Leveled compaction: Split key ranges into levels
B-Trees
Concept
- Most widely used indexing structure
- Used in almost all relational databases
- Keep key-value pairs sorted by key
Structure
- Fixed-size blocks/pages (typically 4KB)
- One page designated as root
- Pages contain keys and references to child pages
- Branching factor: Number of child references per page
Lookup
- Start at root page
- Follow references based on key ranges
- Eventually reach leaf page with actual data
- Most databases: 3-4 levels deep
Example
Root: [200 | 400 | 600]
↓ ↓ ↓
[100-200] [200-400] [400-600]
↓ ↓ ↓
Leaf pages with actual data
Updating B-Trees
Insert:
- Find appropriate leaf page
- Add key-value pair
- If page full → split into two
- Update parent page
Write-Ahead Log (WAL):
- Every modification written to WAL first
- Used for crash recovery
- Protects against corrupted index
B-Tree Optimizations
- Copy-on-write: Write modified page to new location
- Abbreviated keys: Store partial keys in interior pages
- Sibling pointers: Link leaf pages for efficient scanning
- Fractal trees: Borrow log-structured ideas
Comparing B-Trees and LSM-Trees
LSM-Tree Advantages
| Aspect | LSM-Tree | B-Tree |
|---|---|---|
| Write amplification | Lower | Higher |
| Write throughput | Higher | Lower |
| Storage efficiency | Better compression | Fragmentation |
| Sequential I/O | Yes | No |
LSM-Tree Disadvantages
| Aspect | LSM-Tree | B-Tree |
|---|---|---|
| Read performance | Slower (check multiple structures) | Faster |
| Compaction overhead | Can interfere with queries | None |
| Write bandwidth | Shared between writes and compaction | Dedicated to writes |
| Transaction isolation | Multiple copies of key | Single location |
Write Amplification
- LSM-tree: Data rewritten during compaction
- B-tree: Data written at least twice (WAL + page)
- Concerning for SSDs (limited write cycles)
When to Choose What
Choose LSM-Tree when:
- Write-heavy workload
- Sequential writes preferred
- Storage efficiency matters
Choose B-Tree when:
- Read-heavy workload
- Predictable latency required
- Strong transactional semantics needed
Other Indexing Structures
Secondary Indexes
- Not unique (many rows can have same key)
- Can be built from key-value indexes
- Used for efficient joins
Storing Values
Heap File:
- Index stores reference to row location
- Row stored in no particular order
- Avoids data duplication
Clustered Index:
- Row data stored directly in index
- Faster reads (no extra hop)
- Used by MySQL InnoDB (primary key)
Covering Index:
- Stores some columns in index
- Query answered using index alone
- Trade-off: faster reads vs more storage
Multi-Column Indexes
Concatenated Index:
- Combine multiple fields into one key
- Like old-fashioned phone book (lastname, firstname)
- Useful for queries on leading columns
Multi-Dimensional Indexes:
- For geospatial data, weather data
- B-tree/LSM-tree can't efficiently query multiple dimensions
- Use R-trees or space-filling curves
Full-Text Search
- Fuzzy querying (typo tolerance)
- Lucene uses SSTable-like structure
- In-memory index as finite state automaton
- Levenshtein automaton for edit distance
In-Memory Databases
- Keep entire dataset in RAM
- Durability via WAL or snapshots
- Products: VoltDB, MemSQL, Redis, Memcached
- Very fast (no disk I/O for reads)
Transaction Processing or Analytics?
OLTP (Online Transaction Processing)
- User-facing applications
- Many queries, each touching few records
- Read and write via index
- High throughput of short transactions
OLAP (Online Analytical Processing)
- Analytics, reporting
- Few queries, each touching many records
- Read-only, full table scans
- Low throughput of heavy queries
Data Warehousing
- Copy of OLTP data for analytics
- Schema optimized for queries (star/snowflake)
- ETL process to load data
Star and Snowflake Schemas
Star Schema:
- Central fact table (events, transactions)
- Dimension tables (products, customers, time)
- Simple joins
Snowflake Schema:
- Dimension tables further normalized
- More complex joins
- Saves space
Column-Oriented Storage
Concept
- Store each column separately
- Much better for analytics queries
Benefits
Compression:
- Similar values in same column → excellent compression
- Bitmap indexes for low-cardinality columns
Vectorized Processing:
- Process multiple values at once
- Better CPU cache utilization
Column Families (Bigtable/Cassandra/HBase)
- Group related columns together
- Compromise between row-oriented and column-oriented
Writing to Column-Oriented Storage
- Use LSM-tree style approach
- Buffer writes in memory
- Flush to disk in sorted segments
Key Takeaways
- Indexing trade-off: Speed up reads, slow down writes
- LSM-trees: Better for write-heavy workloads
- B-trees: Better for read-heavy workloads
- Column-oriented storage: Much better for analytics
- OLTP vs OLAP: Different access patterns need different optimizations
- Data locality: Grouping related data improves performance