Chapter 2: Data Models and Query Languages

6 min read

Core Concepts

Data Models Matter

Data models have profound effect on how software is written AND how we think about problems
Most applications are built by layering data models on top of each other

Layered Data Representation:

Application developer models real world → objects/data structures
Data structures → general-purpose data model (JSON, relational tables, graphs)
Database engineers → bytes in memory/disk/network
Hardware engineers → electrical currents, magnetic fields

Relational Model vs Document Model

Relational Model (SQL)

Proposed by Edgar Codd in 1970
Data organized into relations (tables) of tuples (rows)
Dominated for 25-30 years
Roots in business data processing (transaction processing, batch processing)

Key Advantage: Hid implementation details behind clean interface

Document Model (NoSQL)

Driving Forces Behind NoSQL:

Need for greater scalability (large datasets, high write throughput)
Preference for free/open source software
Specialized query operations not well supported by relational model
Frustration with restrictive schemas

Example - LinkedIn Profile:

Relational approach:

-- Multiple tables with foreign keys
users (user_id, first_name, last_name, ...)
positions (user_id, job_title, organization, ...)
education (user_id, school_name, start, end, ...)
contact_info (user_id, blog, twitter, ...)

Document approach (JSON):

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "positions": [...],
  "education": [...],
  "contact_info": {...}
}

Object-Relational Mismatch

Impedance mismatch: Disconnect between OOP objects and relational tables
ORM frameworks (ActiveRecord, Hibernate) help but don't eliminate the mismatch
JSON document model can feel more natural for document-like data

Many-to-One and Many-to-Many Relationships

Normalization vs Denormalization:

Aspect	Normalized (IDs)	Denormalized (Strings)
Storage	Less duplication	More duplication
Updates	Update in one place	Update everywhere
Consistency	Easier to maintain	Risk of inconsistency
Query performance	Requires joins	Single query

Document Database Limitations:

Weak join support
Many-to-many relationships require multiple queries
Data becomes more interconnected over time

Historical Context: Document DBs Repeating History

Hierarchical Model (IMS, 1968):

Similar to JSON/document model
Tree of records nested within records
Good for one-to-many, poor for many-to-many

Network Model (CODASYL):

Generalization of hierarchical model
Records could have multiple parents
Required manual "access path" navigation
Code became complex and inflexible

Relational Model Advantage:

Laid out all data openly
Query optimizer automatically decides access paths
Easy to add new features without changing queries

Schema Flexibility

Schema-on-Write (Relational):

Schema is explicit and enforced
Requires migrations for changes
Similar to static typing

Schema-on-Read (Document):

Structure is implicit, interpreted when read
More flexible for changing data formats
Similar to dynamic typing

When Schema-on-Read is Better:

Heterogeneous data (many different object types)
Data structure determined by external systems
Rapidly evolving requirements

Data Locality

Document Model Advantage:

Entire document stored as single continuous string
Better locality for queries that access whole document
Single query retrieves all relevant data

Document Model Limitations:

Must load entire document even for small portions
Updates may require rewriting entire document
Keep documents small for best performance

Convergence

Modern databases are becoming more similar:

Relational databases adding JSON support (PostgreSQL, MySQL, IBM DB2)
Document databases adding relational-like joins (RethinkDB)
Hybrid approach: Best of both worlds

Question	Consider Relational	Consider Document	Consider Graph
Many-to-many relationships?	✓ Good join support	✗ Weak join support	✓ Best for complex relationships
Schema evolving frequently?	✗ Needs migrations	✓ Schema-on-read	Depends on DB
Need whole document at once?	✗ Multiple queries/joins	✓ Single query, data locality	✗ Traversal overhead
Write-heavy with complex validation?	✓ ACID, constraints	✗ Limited constraints	✗ Variable by DB
Deeply nested one-to-many data?	✗ Many tables needed	✓ Natural JSON nesting	✓ Natural graph nesting
Ad-hoc analytical queries?	✓ SQL is standard	✗ Limited query languages	✗ Specialized query languages

Query Languages

Imperative vs Declarative

Imperative (e.g., JavaScript):

function getSharks(animals) {
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
        if (animals[i].family === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}

Declarative (SQL/Relational Algebra):

SELECT * FROM animals WHERE family = 'Sharks';

Advantages of Declarative:

More concise and easier to work with
Hides implementation details → database can optimize
Better for parallel execution
Database can introduce improvements without changing queries

Declarative on the Web

CSS is declarative:

li.selected > p { background-color: blue; }

XSL/XPath is declarative:

<xsl:template match="li[@class='selected']/p">
    <fo:block background-color="blue">

JavaScript DOM manipulation is imperative (verbose, error-prone)

Graph-Like Data Models

When to Use Graphs

Data has many-to-many relationships
Relationships between data items are complex
Examples: Social networks, knowledge graphs, road networks

Property Graph Model (Neo4j)

Vertices have:

Unique identifier
Set of outgoing edges
Set of incoming edges
Collection of properties (key-value pairs)

Edges have:

Unique identifier
Vertex at each end
Relationship label (e.g., "FRIEND_OF", "WORKS_IN")
Collection of properties

Example:

(Lucy) -[:BORN_IN]-> (Idaho)
(Lucy) -[:FRIEND_OF]-> (Armstrong)

Query Language: Cypher

MATCH (a)-[:BORN_IN]->(s)<-[:BORN_IN]-(b)
WHERE a.name = 'Alice' AND b.name = 'Bob'
RETURN b.name

RDF Triple Stores

Data Model: Subject-Predicate-Object triples

(Lucy, born-in, Idaho)
(Idaho, type, State)
(Idaho, name, "Idaho")

Standards:

RDF (Resource Description Framework)
SPARQL query language
Turtle format

Example SPARQL:

SELECT ?personName WHERE {
  ?person :bornIn :Idaho .
  ?person :name ?personName .
}

Graph Databases vs Relational Databases

Aspect	Graph DB	Relational DB
Many-to-many relationships	Natural	Possible but awkward
Join performance	O(1) per relationship	Can be expensive
Schema flexibility	Schema-on-read	Schema-on-write
Query optimization	Manual traversal planning	Automatic optimizer

Cypher Query Language

Creating Nodes:

CREATE (joe:Person {name: 'Joe', age: 25})
CREATE (china:Country {name: 'China'})
CREATE (joe)-[:BORN_IN]->(china)

Querying:

MATCH (a)-[:KNOWS]->(b)-[:KNOWS]->(c)
WHERE a.name = 'Alice'
RETURN c.name

SPARQL Query Language

PREFIX : <http://example.org/>
SELECT ?personName WHERE {
  ?person :bornIn :Idaho .
  ?person :name ?personName .
}

Key Takeaways

Data models profoundly influence how we think about problems
Relational model excels at joins and many-to-many relationships
Document model excels at document-like data and locality
Schema-on-read vs schema-on-write is like dynamic vs static typing
Declarative query languages enable optimization and parallelism
Graph models are natural for highly connected data
Modern databases are converging - hybrid approaches are emerging

Core Concepts #

Data Models Matter #

Relational Model vs Document Model #

Relational Model (SQL) #

Document Model (NoSQL) #

Object-Relational Mismatch #

Many-to-One and Many-to-Many Relationships #

Historical Context: Document DBs Repeating History #

Schema Flexibility #

Data Locality #

Convergence #

Query Languages #

Imperative vs Declarative #

Declarative on the Web #

Graph-Like Data Models #

When to Use Graphs #

Property Graph Model (Neo4j) #

RDF Triple Stores #

Graph Databases vs Relational Databases #

Cypher Query Language #

SPARQL Query Language #

Key Takeaways #