Lesson 4

Lesson 4: Memory Management Deep Dive

Jun 26, 2026 9 min read

After this lesson, you understand memory at the hardware level — page table walks, TLB behavior, and allocator design. More importantly, you can explain why vLLM's KV-cache block manager is fundamentally a slab allocator for GPU memory, and why Ghostty's cell grids are cache-friendly by design.

Page Table Walk-Through (x86-64 4-Level)

Every virtual address on x86-64 passes through four levels of page tables:

flowchart LR
    subgraph VA["Virtual Address (48 bits)"]
        direction LR
        sign["Sign Extension<br/>16 bits<br/>bits 63:48"]
        L4["L4 (PML4)<br/>9 bits<br/>bits 47:39"]
        L3["L3 (PDP)<br/>9 bits<br/>bits 38:30"]
        L2["L2 (PD)<br/>9 bits<br/>bits 29:21"]
        L1["L1 (PT)<br/>9 bits<br/>bits 20:12"]
        offset["Offset<br/>12 bits<br/>bits 11:0"]
    end

The hardware walks this tree on every memory access (unless cached in the TLB):

CR3 register → physical address of L4 table
L4[47:39] → physical address of L3 table
L3[38:30] → physical address of L2 table
L2[29:21] → physical address of L1 table
L1[20:12] → physical page number (4KB page)
Offset[11:0] → byte within the page

Each table entry is 8 bytes. Each table has 512 entries (9 bits). An entry either contains the physical address of the next level's table, or the physical page number plus permission bits.

Page Table Entry (PTE) flags

Bit	Name	Meaning
0	Present	Page is in physical memory (if 0, page fault)
1	R/W	0 = read-only, 1 = writable
2	U/S	0 = kernel only, 1 = user accessible
5	Accessed	Set by hardware on access; kernel clears it for page aging
6	Dirty	Set by hardware on write; tells kernel to write back to disk
63	NX	No-execute (if set, instruction fetch faults)

The TLB

A TLB miss on a 4-level walk costs 4 memory accesses (1 per level). With a 200-cycle RAM latency, that's 800 cycles — per address translation. The TLB is a small, fully-associative cache (typically 64-256 entries for L1 D-TLB) that makes this tolerable. A TLB hit costs 0-1 extra cycles.

Huge pages reduce TLB pressure. A 2MB page uses 3 levels instead of 4 (L1 entry points directly to the 2MB page). A 1GB page uses 2 levels. For a terminal rendering glyphs into a texture atlas, the atlas is a contiguous allocation that benefits from huge pages.

malloc Internals

malloc() is not a system call. It's a user-space allocator that manages a pool of memory obtained from the kernel via brk() (for small allocations) and mmap() (for large ones).

Arenas

ptmalloc (glibc's allocator) organizes memory into arenas — one main arena plus per-thread arenas to reduce contention. Each arena maintains its own set of free lists (bins).

Bins

Bin type	Size range	Strategy
Fast bins	16–80 bytes	LIFO, single-linked, never coalesced. Malloc pops from top; free pushes to top. Cache-hot, no fragmentation recovery.
Small bins	512 bytes or less	FIFO, doubly-linked, coalesced with neighbors on free
Large bins	> 512 bytes	Sorted by size, best-fit search
Unsorted bin	Anything freed that isn't fastbin	Catch-all. On next malloc, items are sorted into correct bins
Top chunk	Remainder of the arena	Last resort. Extended via `sbrk()` or `mmap()`

sbrk vs mmap

	sbrk	mmap
Use case	Small allocations (< 128KB default)	Large allocations (≥ 128KB default)
Memory source	Heap (grows upward from end of BSS)	Anywhere in address space
Free behavior	Can't release back to OS easily (only shrink from top)	`munmap()` releases immediately
Fragmentation	External fragmentation risk	Each allocation is independent — no external fragmentation

The 128KB threshold is configurable via mallopt(M_MMAP_THRESHOLD, ...). Allocations above this size bypass the arena system entirely and use mmap directly.

Allocator Design

Bump allocator

The simplest possible allocator. A single pointer advances through a pre-allocated region. Allocation: ptr increments by size. Free: no-op (can't free individual allocations). Reset: ptr = start.

void *bump_alloc(size_t size) {
    void *result = bump_ptr;
    bump_ptr += align_up(size, 16);
    return result;
}

Use case: arena/pool allocation where all allocations die together (frame allocator in a game engine, request-scoped allocations in a web server, per-frame allocations in a GPU renderer).

Slab allocator

Pre-allocates fixed-size objects. Free list links free objects together. Allocation pops from free list. Free pushes back. No fragmentation, no searching, O(1) for both operations.

typedef struct slab {
    struct slab *next;  // embedded free list pointer
    char data[];        // actual object storage
} slab_t;

slab_t *free_list = NULL;

void *slab_alloc() {
    slab_t *obj = free_list;
    free_list = free_list->next;
    return obj->data;
}

void slab_free(void *ptr) {
    slab_t *obj = container_of(ptr, slab_t, data);
    obj->next = free_list;
    free_list = obj;
}

Use case: objects of uniform size created and destroyed frequently. Terminal cell structures. Network packet buffers. vLLM KV-cache blocks — each block is fixed-size (16 tokens × N heads × head_dim = constant bytes).

Bridge to vLLM: vLLM's BlockSpaceManager is a slab allocator for GPU memory. KV-cache blocks are fixed-size allocations (e.g., 16 tokens per block). The block manager maintains a free list of available blocks. Allocation = pop from free list. Free = push back. When a request generates more tokens, it allocates new blocks and chains them. This is exactly the slab pattern, running on GPU VRAM instead of CPU RAM.

Copy-on-Write in Detail

When fork() returns:

Both parent and child page tables point to the same physical pages
All PTEs are marked read-only (R/W = 0)
When either process writes, the hardware triggers a #PF (page fault)
The kernel's page fault handler sees: user-mode write to a present page marked read-only, page is copy-on-write
Kernel allocates a new physical page, copies the contents byte-for-byte
Kernel updates the faulting process's PTE to point to the new page, R/W = 1
Kernel updates the other process's PTE to R/W = 1 (it now has exclusive ownership)
Kernel decrements the reference count on the original physical page
If refcount hits 0, the original page is freed

This is why the child's address space is "identical but separate." The copy only happens on the pages that diverge. In fork() + exec(), the child immediately calls exec(), which replaces the entire address space. The copy-on-write pages are never written to — the child discards them without any copy ever happening. This is why fork() + exec() is efficient.

Huge Pages

Page size	TLB entries needed for 2MB	TLB entries needed for 1GB
4KB	512	262,144
2MB	1	512
1GB	—	1

Huge pages reduce TLB misses by covering more memory per entry. The cost is internal fragmentation (a 2MB page for a 10KB allocation wastes 2MB-10KB) and page table simplicity (fewer levels).

Terminal emulators benefit from huge pages for:

The terminal grid (contiguous virtual memory, frequently accessed)
The font glyph atlas (large, read-heavy, accessed for every glyph render)
CPU/GPU shared buffers (mmap'd, touched by both sides)

NUMA Awareness

On multi-socket systems, each CPU socket has its own memory controller and "local" memory:

flowchart TD
    subgraph S0["Socket 0"]
        cpu0["CPU 0-7<br/>L3 Cache"]
        mc0["Memory Controller"]
    end
    subgraph S1["Socket 1"]
        cpu1["CPU 8-15<br/>L3 Cache"]
        mc1["Memory Controller"]
    end
    ram0["RAM 0<br/>64 GB (local to Socket 0)"]
    ram1["RAM 1<br/>64 GB (local to Socket 1)"]
    mc0 -->|"local access\n~100ns"| ram0
    mc0 -.->|"remote access\n~130-200ns"| ram1
    mc1 -->|"local access"| ram1
    mc1 -.->|"remote access"| ram0

Accessing remote memory costs 1.3-2x more latency than local memory. Linux's default allocator is NUMA-aware: memory allocated on the same node as the thread that called malloc(). For terminal emulators, the I/O thread and render thread might be on different sockets — shared buffers should be allocated on the node where the most frequent accessor (the render thread) lives.

#include <numa.h>
void *buf = numa_alloc_onnode(size, node);

Project: mmap Playground

Write a program that explores the virtual address space:

// 1. mmap a 1GB private anonymous region, touch one byte per 4096
// 2. Observe RSS with /proc/self/smaps or macOS vmmap
// 3. mmap the same file twice, MAP_SHARED — show writes visible through both mappings
// 4. mmap with MAP_FIXED — drop a mapping at a specific address
// 5. mprotect a page read-only, trigger SIGSEGV by writing to it
// 6. madvise(MADV_DONTNEED) — observe RSS drop after hinting pages are no longer needed

Project: Bump and Slab Allocator

Implement both allocators and compare:

Bump allocator: allocate 1M objects of varying sizes (16, 32, 64, 128, 256). Measure throughput.
Slab allocator: allocate 1M objects of fixed size (64 bytes). Measure throughput.
glibc malloc: same benchmarks. Compare throughput and fragmentation.
Visualize fragmentation: after freeing every other allocation, what does the heap look like?

Ghostty Source to Study

Zig's explicit allocator passing — every function that allocates takes an Allocator parameter
Arena allocators for frame-scoped allocations (per-render-frame data)
FixedBufferAllocator for stack-like temporary allocations
GeneralPurposeAllocator as the default, with leak detection in debug builds

Bridge to vLLM KV-Cache

vLLM's BlockSpaceManager manages GPU memory for KV-cache blocks. Each block stores key-value tensors for a contiguous sequence of tokens (typically 16). The manager:

Pre-allocates a pool of GPU memory divided into fixed-size blocks
Maintains a free list (exactly like a slab allocator's free list)
When a request generates tokens, allocates blocks from the free list
When a request finishes (or is preempted), returns blocks to the free list
Chains blocks together via a block table (analogous to page tables — virtual block IDs → physical block addresses)

The fundamental operations — allocate, free, maintain free list — are identical to the slab allocator you just built. The difference is the backing store (GPU VRAM vs CPU RAM) and the access pattern (GPU kernel vs CPU).

Self-Check

Can you:

Walk through a virtual address translation on x86-64 (all 4 levels)
Explain what happens on a TLB miss versus a page fault — they are different things
Name the five bin types in ptmalloc and when each is used
Implement a bump allocator and a slab allocator from memory
Explain why a slab allocator is exactly what vLLM's KV-cache block manager needs
Explain the tradeoff between 4KB, 2MB, and 1GB pages
Describe what fork() does at the page table level and why it's fast
Explain how a terminal emulator's cell grid can benefit from huge pages

Page Table Walk-Through (x86-64 4-Level) #

Page Table Entry (PTE) flags #

The TLB #

malloc Internals #

Arenas #

Bins #

sbrk vs mmap #

Allocator Design #

Bump allocator #

Slab allocator #

Copy-on-Write in Detail #

Huge Pages #

NUMA Awareness #

Project: mmap Playground #

Project: Bump and Slab Allocator #

Ghostty Source to Study #

Bridge to vLLM KV-Cache #

Self-Check #

Page Table Walk-Through (x86-64 4-Level)

Page Table Entry (PTE) flags

The TLB

malloc Internals

Arenas

Bins

sbrk vs mmap

Allocator Design

Bump allocator

Slab allocator

Copy-on-Write in Detail

Huge Pages

NUMA Awareness

Project: mmap Playground

Project: Bump and Slab Allocator

Ghostty Source to Study

Bridge to vLLM KV-Cache

Self-Check