S3-like Object Storage

8 min read

Design object storage similar to Amazon S3. S3 launched 2006; 100+ trillion objects by 2021.

Storage System 101

	Block Storage	File Storage	Object Storage
Mutable	Yes	Yes	No (versioning, not in-place)
Cost	High	Med-High	Low
Performance	Med-Very High	Med-High	Low-Med
Consistency	Strong	Strong	Strong
Access	SAS/iSCSI/FC	CIFS/SMB, NFS	RESTful API
Scalability	Medium	High	Vast
Best for	VMs, DBs	General file access	Binary/unstructured data

Block: raw blocks as volumes, physically or network-attached (FC, iSCSI). Single-server ownership.
File: built on block storage, hierarchical directories, shared via SMB/NFS.
Object: flat structure, RESTful API, "cold" data, archival/backup. Sacrifices performance for durability + scale + low cost.

Terminology:

Bucket: logical container, globally unique name
Object: payload + metadata (name-value pairs)
Versioning: multiple variants of object in bucket
URI: uniquely identifies resources
SLA: e.g. S3 Standard-IA: 99.999999999% durability, 99.9% availability

Step 1 - Understand the Problem and Establish Design Scope

Requirements:

Bucket creation, object upload/download, versioning, listing objects in bucket
Both massive objects (few GBs+) and many small objects (tens of KBs)
100 PB data per year
Durability: 6 nines (99.9999%), Availability: 4 nines (99.99%)
Storage efficiency (reduce costs while maintaining reliability/performance)

Back-of-the-envelope estimation:

Object distribution: 20% small (<1MB), 60% medium (1-64MB), 20% large (>64MB)
HDD: 100-150 IOPS (7200 rpm SATA)
Using median sizes (0.5MB, 32MB, 200MB), 40% storage usage:
- 100 PB → 10¹¹ MB
- ~0.68 billion objects
- Metadata ~1KB each → 0.68 TB metadata

Step 2 - Propose High-Level Design and Get Buy-In

Key properties:

Object immutability: no incremental changes; replace with new version
Key-value store: URI = key, object data = value
Write once, read many: 95% of requests are reads (LinkedIn)
Support both small and large objects

Analogy: Like UNIX file system — inode (metadata) separate from file data (disk blocks). Object storage: metadata store (object metadata) + data store (object data), connected via object UUID over network.

Figure 2: Unix file system and object store

Figure 3: Bucket & object

High-level design

Figure 4: High-level design

Load balancer: distributes RESTful API requests
API service: stateless, orchestrates IAM + metadata + data store calls
IAM: authentication + authorization
Data store: stores/retrieves object data by UUID
Metadata store: stores object metadata

Uploading an object

Figure 5: Uploading an object

Client: PUT to create bucket → API service
API → IAM: verify WRITE permission
API → metadata store: create bucket entry
Client: PUT to create object script.txt
API → IAM: verify WRITE on bucket
API → data store: persist payload → returns UUID
API → metadata store: create entry (object_name, object_id/UUID, bucket_id)

Sample upload request:

PUT /bucket-to-share/script.txt HTTP/1.1
Host: foo.s3example.org
Authorization: authorization string
Content-Type: text/plain
Content-Length: 4567
x-amz-meta-author: Alex
[4567 bytes of object data]

Downloading an object

Figure 6: Downloading an object

Client: GET /bucket-to-share/script.txt
API → IAM: verify READ access
API → metadata store: fetch UUID by object name
API → data store: fetch data by UUID
Return object data to client

Logical hierarchy via naming: bucket-to-share/script.txt implies folder structure (no actual directories).

Step 3 - Design Deep Dive

Data store

Figure 7: Upload and download an object

Data store components

Figure 8: Data store components

Data routing service: stateless RESTful/gRPC service. Queries placement service for best data node; reads/writes data to nodes.

Placement service: determines which data nodes (primary + replicas) store an object. Maintains virtual cluster map (physical topology). Ensures replicas are physically separated for durability. Monitors nodes via heartbeats (15s grace period). Critical → 5-7 nodes with Paxos/Raft consensus (tolerates minority failure).

Figure 9: Virtual cluster map

Data node: stores actual object data. Replicates to multiple nodes (replication group). Daemon sends heartbeats (how many drives, how much data). On first heartbeat, placement assigns: unique ID, virtual cluster map, replication targets.

Data persistence flow

Figure 10: Data persistence flow

API → data store: forward object data
Data routing: generate UUID, query placement → get primary data node (via consistent hashing)
Send data to primary + UUID
Primary saves locally → replicates to 2 secondary nodes → responds only after all replicas succeed
Return UUID to API service

Consistency vs latency trade-off:

Figure 11: Consistency vs latency

All 3 nodes: best consistency, highest latency
Primary + 1 secondary: medium consistency/latency (eventual)
Primary only: worst consistency, lowest latency (eventual)

How data is organized

Storing each object as a separate file fails at scale:

Wastes disk blocks (4KB block for tiny file)
Exhausts inode capacity
OS handles millions of inodes poorly

Solution: Merge small objects into larger files (WAL-like). Append to read-write file. At capacity (~few GBs) → mark read-only, create new read-write file. Serialize writes to read-write file. For multi-core servers: dedicated read-write file per core.

Figure 12: Store multiple small objects in one big file

Object lookup

To locate object by UUID: need data file name, start offset, object size.

Field	Description
object_id	UUID
file_name	Name of containing file
start_offset	Beginning address in file
object_size	Number of bytes

Database choice: Read-heavy (write once, read many). RocksDB (SSTable-based, fast writes) vs. Relational (B+ tree, fast reads). Relational wins (better read perf). Deploy SQLite per data node (local, file-based, isolated per node).

Tables 3-4: Object_mapping table

Updated persistence flow

Figure 13: Updated data persistence flow

API sends object 4
Data node appends to /data/c
Insert record into object_mapping table
Return UUID

Durability

Hardware failure & failure domains

Annual disk failure rate: ~0.81%. 3 copies → 1-(0.0081³) ≈ 99.9999% reliability
Failure domains: node-level (server components), rack-level (shared switches/power), AZ-level (separate infrastructure)
Replicate data across AZs for extreme-case resilience

Figure 14: Multi-Datacenter replication

Erasure coding

Alternative to replication. (4+2) example: data split into 4 equal chunks (d1-d4), 2 parities (p1, p2) computed mathematically. Lose d3, d4 → reconstruct from d1, d2, p1, p2.

Figure 15: Erasure coding

(8+4): 8 data chunks + 4 parities across 12 failure domains. Can lose up to 4 nodes and reconstruct.

Figure 16: (8+4) erasure coding

Storage overhead: replication 3× → 200%. Erasure coding (8+4) → 50%.

Figure 17: Extra space comparison

Durability from 0.81% AFR: erasure coding → 11 nines.

	Replication	Erasure Coding
Durability	6 nines (3 copies)	11 nines (8+4)
Storage efficiency	200% overhead	50% overhead
Compute	None	Parity calculation
Write perf	Faster (no calculation)	Slower (parity calc)
Read perf	Single replica	Multiple nodes; slower under failure

Replication: latency-sensitive apps. Erasure coding: cost efficiency, archival. This design focuses on replication.

Correctness verification (checksums)

In-memory data corruption common at scale. Append checksum (MD5) to each object + file-level checksum when marked read-only.

Figures 18-20: Checksum generation, comparison, data node layout

Read flow: fetch data + checksum → compute → mismatched? → try other failure domains → reconstruct from 8 valid pieces.

Metadata data model

Schema

Figure 21: Database tables

Two tables: bucket (small, ~10GB for 1M users × 10 buckets × 1KB) and object.

Scale bucket table: single server fits. Spread reads via replicas.

Scale object table: shard by hash(bucket_name, object_name) — most operations are URI-based.

Three queries supported:

Find object ID by name
Insert/delete object by name
List objects in bucket sharing prefix

Listing objects in a bucket

Flat structure, organized by prefixes. Path: s3://mybucket/abc/d/e/f/file.txt → bucket: mybucket, object: abc/d/e/f/file.txt, prefix: abc/d/e/f/.

Three uses:

aws s3 list-buckets — list all user's buckets
aws s3 ls s3://mybucket/abc/ — list at same prefix level (rolls up deeper prefixes)
aws s3 ls s3://mybucket/abc/ --recursive — all objects sharing prefix

Single DB: SELECT * FROM object WHERE bucket_id = "123" AND object_name LIKE 'abc/%'

Sharded DB (distributed): query all shards → aggregate. Paging is complex: each shard returns different counts, offsets vary per shard, cursor must track all offsets. Solution: denormalize listing data into a separate table sharded by bucket_id → isolates listing to single DB. Acceptable trade-off (listing performance is rarely priority).

Object versioning

Figure 22: Object versioning

Upload flow for versioned object: same as normal, but metadata store inserts new row (not overwrite) with same bucket_id + object_name, new object_id (UUID) + object_version (TIMEUUID). Current version = largest TIMEUUID.

Figure 23: Versioned metadata

Deletion: all versions remain; insert a delete marker as new version. GET on delete marker → 404.

Figure 24: Delete with delete marker

Optimizing uploads of large files — Multipart upload

Figure 25: Multipart upload

Initiate multipart upload → get uploadID
Split file into parts (e.g. 1.6GB → 8 × 200MB). Upload each part with uploadID.
Each part returns ETag (MD5 checksum)
Complete multipart upload: send uploadID + part numbers + ETags
Data store reassembles object (may take minutes for large files)

Old parts after reassembly → garbage collected.

Garbage collection

Reclaims space from: lazy deletion (marked deleted), orphan data (half-uploaded/abandoned multipart), corrupted data (checksum failures).

Compaction: Copy non-deleted objects from read-only files to new file. Skip objects with delete_flag = true. Update object_mapping table (file_name, start_offset). Wrap updates in transaction.

Figure 26: Compaction

Wait until many read-only files accumulate to compact → avoid creating many small files.

Step 4 - Wrap Up

Designed S3-like object storage: storage taxonomy → high-level design (data store + metadata store separation) → deep dive into data persistence (WAL-like merging, SQLite per node), durability (replication vs. erasure coding, checksums), metadata sharding (hash of bucket+object name), listing (denormalized table), versioning (TIMEUUID + delete markers), multipart upload, garbage collection (compaction).

Storage System 101

Step 1 - Understand the Problem and Establish Design Scope

Step 2 - Propose High-Level Design and Get Buy-In

High-level design #

Uploading an object #

Downloading an object #

Step 3 - Design Deep Dive

Data store #

Data store components #

Data persistence flow #

How data is organized #

Object lookup #

Updated persistence flow #

Durability #

Hardware failure & failure domains #

Erasure coding #

Correctness verification (checksums) #

Metadata data model #

Schema #

Listing objects in a bucket #

Object versioning #

Optimizing uploads of large files — Multipart upload #

Garbage collection #