Chapter 2: Data Models and Query Languages
5 min readCore Concepts
Data Models Matter
- Data models have profound effect on how software is written AND how we think about problems
- Most applications are built by layering data models on top of each other
Layered Data Representation:
- Application developer models real world → objects/data structures
- Data structures → general-purpose data model (JSON, relational tables, graphs)
- Database engineers → bytes in memory/disk/network
- Hardware engineers → electrical currents, magnetic fields
Relational Model vs Document Model
Relational Model (SQL)
- Proposed by Edgar Codd in 1970
- Data organized into relations (tables) of tuples (rows)
- Dominated for 25-30 years
- Roots in business data processing (transaction processing, batch processing)
Key Advantage: Hid implementation details behind clean interface
Document Model (NoSQL)
Driving Forces Behind NoSQL:
- Need for greater scalability (large datasets, high write throughput)
- Preference for free/open source software
- Specialized query operations not well supported by relational model
- Frustration with restrictive schemas
Example - LinkedIn Profile:
Relational approach:
-- Multiple tables with foreign keys
users (user_id, first_name, last_name, ...)
positions (user_id, job_title, organization, ...)
education (user_id, school_name, start, end, ...)
contact_info (user_id, blog, twitter, ...)
Document approach (JSON):
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"positions": [...],
"education": [...],
"contact_info": {...}
}
Object-Relational Mismatch
- Impedance mismatch: Disconnect between OOP objects and relational tables
- ORM frameworks (ActiveRecord, Hibernate) help but don't eliminate the mismatch
- JSON document model can feel more natural for document-like data
Many-to-One and Many-to-Many Relationships
Normalization vs Denormalization:
| Aspect | Normalized (IDs) | Denormalized (Strings) |
|---|---|---|
| Storage | Less duplication | More duplication |
| Updates | Update in one place | Update everywhere |
| Consistency | Easier to maintain | Risk of inconsistency |
| Query performance | Requires joins | Single query |
Document Database Limitations:
- Weak join support
- Many-to-many relationships require multiple queries
- Data becomes more interconnected over time
Historical Context: Document DBs Repeating History
Hierarchical Model (IMS, 1968):
- Similar to JSON/document model
- Tree of records nested within records
- Good for one-to-many, poor for many-to-many
Network Model (CODASYL):
- Generalization of hierarchical model
- Records could have multiple parents
- Required manual "access path" navigation
- Code became complex and inflexible
Relational Model Advantage:
- Laid out all data openly
- Query optimizer automatically decides access paths
- Easy to add new features without changing queries
Schema Flexibility
Schema-on-Write (Relational):
- Schema is explicit and enforced
- Requires migrations for changes
- Similar to static typing
Schema-on-Read (Document):
- Structure is implicit, interpreted when read
- More flexible for changing data formats
- Similar to dynamic typing
When Schema-on-Read is Better:
- Heterogeneous data (many different object types)
- Data structure determined by external systems
- Rapidly evolving requirements
Data Locality
Document Model Advantage:
- Entire document stored as single continuous string
- Better locality for queries that access whole document
- Single query retrieves all relevant data
Document Model Limitations:
- Must load entire document even for small portions
- Updates may require rewriting entire document
- Keep documents small for best performance
Convergence
Modern databases are becoming more similar:
- Relational databases adding JSON support (PostgreSQL, MySQL, IBM DB2)
- Document databases adding relational-like joins (RethinkDB)
- Hybrid approach: Best of both worlds
Query Languages
Imperative vs Declarative
Imperative (e.g., JavaScript):
function getSharks(animals) {
var sharks = [];
for (var i = 0; i < animals.length; i++) {
if (animals[i].family === "Sharks") {
sharks.push(animals[i]);
}
}
return sharks;
}
Declarative (SQL/Relational Algebra):
SELECT * FROM animals WHERE family = 'Sharks';
Advantages of Declarative:
- More concise and easier to work with
- Hides implementation details → database can optimize
- Better for parallel execution
- Database can introduce improvements without changing queries
Declarative on the Web
CSS is declarative:
li.selected > p { background-color: blue; }
XSL/XPath is declarative:
<xsl:template match="li[@class='selected']/p">
<fo:block background-color="blue">
JavaScript DOM manipulation is imperative (verbose, error-prone)
Graph-Like Data Models
When to Use Graphs
- Data has many-to-many relationships
- Relationships between data items are complex
- Examples: Social networks, knowledge graphs, road networks
Property Graph Model (Neo4j)
Vertices have:
- Unique identifier
- Set of outgoing edges
- Set of incoming edges
- Collection of properties (key-value pairs)
Edges have:
- Unique identifier
- Vertex at each end
- Relationship label (e.g., "FRIEND_OF", "WORKS_IN")
- Collection of properties
Example:
(Lucy) -[:BORN_IN]-> (Idaho)
(Lucy) -[:FRIEND_OF]-> (Armstrong)
Query Language: Cypher
MATCH (a)-[:BORN_IN]->(s)<-[:BORN_IN]-(b)
WHERE a.name = 'Alice' AND b.name = 'Bob'
RETURN b.name
RDF Triple Stores
Data Model: Subject-Predicate-Object triples
(Lucy, born-in, Idaho)
(Idaho, type, State)
(Idaho, name, "Idaho")
Standards:
- RDF (Resource Description Framework)
- SPARQL query language
- Turtle format
Example SPARQL:
SELECT ?personName WHERE {
?person :bornIn :Idaho .
?person :name ?personName .
}
Graph Databases vs Relational Databases
| Aspect | Graph DB | Relational DB |
|---|---|---|
| Many-to-many relationships | Natural | Possible but awkward |
| Join performance | O(1) per relationship | Can be expensive |
| Schema flexibility | Schema-on-read | Schema-on-write |
| Query optimization | Manual traversal planning | Automatic optimizer |
Cypher Query Language
Creating Nodes:
CREATE (joe:Person {name: 'Joe', age: 25})
CREATE (china:Country {name: 'China'})
CREATE (joe)-[:BORN_IN]->(china)
Querying:
MATCH (a)-[:KNOWS]->(b)-[:KNOWS]->(c)
WHERE a.name = 'Alice'
RETURN c.name
SPARQL Query Language
PREFIX : <http://example.org/>
SELECT ?personName WHERE {
?person :bornIn :Idaho .
?person :name ?personName .
}
Key Takeaways
- Data models profoundly influence how we think about problems
- Relational model excels at joins and many-to-many relationships
- Document model excels at document-like data and locality
- Schema-on-read vs schema-on-write is like dynamic vs static typing
- Declarative query languages enable optimization and parallelism
- Graph models are natural for highly connected data
- Modern databases are converging - hybrid approaches are emerging