Chapter 4: Encoding and Evolution
3 min readCore Concepts
The Challenge
- Application data structures change over time
- Need to handle schema evolution
- Data flows through multiple systems
Modes of Dataflow
- Through Databases: Writer encodes, reader decodes
- Through Services: Client encodes, server decodes
- Through Asynchronous Message Passing: Producer encodes, consumer decodes
Formats for Encoding Data
Language-Specific Formats
Examples: Java Serializable, Python pickle, Ruby Marshal
Problems:
- Tied to specific language
- Security vulnerabilities (arbitrary code execution)
- Performance overhead
- Poor cross-language support
JSON, XML, and Binary Variants
JSON:
- Human-readable
- Widely supported
- No binary string support
- Schema-on-read
XML:
- Verbose
- Schema validation (XSD)
- Namespace support
Binary Variants (MessagePack):
- More compact than JSON
- Still schema-on-read
- Better performance
Thrift and Protocol Buffers
Concept: Schema-based binary encoding
Thrift:
struct Person {
1: string firstName,
2: string lastName,
3: int age
}
Protocol Buffers:
message Person {
string first_name = 1;
string last_name = 2;
int32 age = 3;
}
Key Features:
- Field tags (numbers) for versioning
- Forward/backward compatibility
- Code generation for many languages
Avro
Concept: Schema-based binary encoding, optimized for Hadoop
{
"type": "record",
"name": "Person",
"fields": [
{"name": "firstName", "type": "string"},
{"name": "lastName", "type": "string"},
{"name": "age", "type": "int"}
]
}
Key Features:
- No field tags in binary encoding
- Schema required to decode
- Writer's schema and reader's schema
- Schema resolution for evolution
The Merits of Schemas
Benefits:
- Documentation
- Schema validation
- Code generation
- Compact encoding
- Schema evolution
Modes of Dataflow Revisited
Dataflow Through Databases
- Writer encodes data → database stores it → reader decodes
- Challenge: Old data may use different schema
- Solution: Schema evolution (backward/forward compatibility)
Dataflow Through Services (REST/RPC)
- REST: Resource-based, HTTP methods
- RPC: Remote procedure call abstraction
Challenges:
- Versioning
- Backward/forward compatibility
- Performance overhead
Dataflow Through Message Passing
- Producer sends encoded message
- Consumer decodes and processes
- Asynchronous communication
Challenges:
- Schema compatibility
- Message format evolution
- Ordering guarantees
Schema Evolution Strategies
Backward Compatibility
- New code can read old data
- Reader handles both old and new formats
Forward Compatibility
- Old code can read new data
- New fields ignored by old readers
Full Compatibility
- Both backward and forward
- Required for schema registries
Avro's Approach
- Writer's schema embedded in data
- Reader's schema resolved against writer's
- Fields can be added/removed with defaults
Key Takeaways
- Avoid language-specific formats for cross-system communication
- Schema-based formats (Thrift, Protobuf, Avro) are better for evolution
- Backward and forward compatibility are essential for evolving systems
- Data flows through multiple systems - consider all readers/writers
- Schema registries help manage compatibility in microservices