Lesson 4: Memory Management Deep Dive
After this lesson, you understand memory at the hardware level — page table walks, TLB behavior, and allocator design. More importantly, you can explain why vLLM's KV-cache block manager is fundamentally a slab allocator for GPU memory, and why Ghostty's cell grids are cache-friendly by design.
Page Table Walk-Through (x86-64 4-Level)
Every virtual address on x86-64 passes through four levels of page tables:
flowchart LR
subgraph VA["Virtual Address (48 bits)"]
direction LR
sign["Sign Extension<br/>16 bits<br/>bits 63:48"]
L4["L4 (PML4)<br/>9 bits<br/>bits 47:39"]
L3["L3 (PDP)<br/>9 bits<br/>bits 38:30"]
L2["L2 (PD)<br/>9 bits<br/>bits 29:21"]
L1["L1 (PT)<br/>9 bits<br/>bits 20:12"]
offset["Offset<br/>12 bits<br/>bits 11:0"]
end
The hardware walks this tree on every memory access (unless cached in the TLB):
- CR3 register → physical address of L4 table
- L4[47:39] → physical address of L3 table
- L3[38:30] → physical address of L2 table
- L2[29:21] → physical address of L1 table
- L1[20:12] → physical page number (4KB page)
- Offset[11:0] → byte within the page
Each table entry is 8 bytes. Each table has 512 entries (9 bits). An entry either contains the physical address of the next level's table, or the physical page number plus permission bits.
Page Table Entry (PTE) flags
| Bit | Name | Meaning |
|---|---|---|
| 0 | Present | Page is in physical memory (if 0, page fault) |
| 1 | R/W | 0 = read-only, 1 = writable |
| 2 | U/S | 0 = kernel only, 1 = user accessible |
| 5 | Accessed | Set by hardware on access; kernel clears it for page aging |
| 6 | Dirty | Set by hardware on write; tells kernel to write back to disk |
| 63 | NX | No-execute (if set, instruction fetch faults) |
The TLB
A TLB miss on a 4-level walk costs 4 memory accesses (1 per level). With a 200-cycle RAM latency, that's 800 cycles — per address translation. The TLB is a small, fully-associative cache (typically 64-256 entries for L1 D-TLB) that makes this tolerable. A TLB hit costs 0-1 extra cycles.
Huge pages reduce TLB pressure. A 2MB page uses 3 levels instead of 4 (L1 entry points directly to the 2MB page). A 1GB page uses 2 levels. For a terminal rendering glyphs into a texture atlas, the atlas is a contiguous allocation that benefits from huge pages.
malloc Internals
malloc() is not a system call. It's a user-space allocator that manages a pool of memory obtained from the kernel via brk() (for small allocations) and mmap() (for large ones).
Arenas
ptmalloc (glibc's allocator) organizes memory into arenas — one main arena plus per-thread arenas to reduce contention. Each arena maintains its own set of free lists (bins).
Bins
| Bin type | Size range | Strategy |
|---|---|---|
| Fast bins | 16–80 bytes | LIFO, single-linked, never coalesced. Malloc pops from top; free pushes to top. Cache-hot, no fragmentation recovery. |
| Small bins | 512 bytes or less | FIFO, doubly-linked, coalesced with neighbors on free |
| Large bins | > 512 bytes | Sorted by size, best-fit search |
| Unsorted bin | Anything freed that isn't fastbin | Catch-all. On next malloc, items are sorted into correct bins |
| Top chunk | Remainder of the arena | Last resort. Extended via sbrk() or mmap() |
sbrk vs mmap
| sbrk | mmap | |
|---|---|---|
| Use case | Small allocations (< 128KB default) | Large allocations (≥ 128KB default) |
| Memory source | Heap (grows upward from end of BSS) | Anywhere in address space |
| Free behavior | Can't release back to OS easily (only shrink from top) | munmap() releases immediately |
| Fragmentation | External fragmentation risk | Each allocation is independent — no external fragmentation |
The 128KB threshold is configurable via mallopt(M_MMAP_THRESHOLD, ...). Allocations above this size bypass the arena system entirely and use mmap directly.
Allocator Design
Bump allocator
The simplest possible allocator. A single pointer advances through a pre-allocated region. Allocation: ptr increments by size. Free: no-op (can't free individual allocations). Reset: ptr = start.
void *bump_alloc(size_t size) {
void *result = bump_ptr;
bump_ptr += align_up(size, 16);
return result;
}
Use case: arena/pool allocation where all allocations die together (frame allocator in a game engine, request-scoped allocations in a web server, per-frame allocations in a GPU renderer).
Slab allocator
Pre-allocates fixed-size objects. Free list links free objects together. Allocation pops from free list. Free pushes back. No fragmentation, no searching, O(1) for both operations.
typedef struct slab {
struct slab *next; // embedded free list pointer
char data[]; // actual object storage
} slab_t;
slab_t *free_list = NULL;
void *slab_alloc() {
slab_t *obj = free_list;
free_list = free_list->next;
return obj->data;
}
void slab_free(void *ptr) {
slab_t *obj = container_of(ptr, slab_t, data);
obj->next = free_list;
free_list = obj;
}
Use case: objects of uniform size created and destroyed frequently. Terminal cell structures. Network packet buffers. vLLM KV-cache blocks — each block is fixed-size (16 tokens × N heads × head_dim = constant bytes).
Bridge to vLLM: vLLM's BlockSpaceManager is a slab allocator for GPU memory. KV-cache blocks are fixed-size allocations (e.g., 16 tokens per block). The block manager maintains a free list of available blocks. Allocation = pop from free list. Free = push back. When a request generates more tokens, it allocates new blocks and chains them. This is exactly the slab pattern, running on GPU VRAM instead of CPU RAM.
Copy-on-Write in Detail
When fork() returns:
- Both parent and child page tables point to the same physical pages
- All PTEs are marked read-only (R/W = 0)
- When either process writes, the hardware triggers a #PF (page fault)
- The kernel's page fault handler sees: user-mode write to a present page marked read-only, page is copy-on-write
- Kernel allocates a new physical page, copies the contents byte-for-byte
- Kernel updates the faulting process's PTE to point to the new page, R/W = 1
- Kernel updates the other process's PTE to R/W = 1 (it now has exclusive ownership)
- Kernel decrements the reference count on the original physical page
- If refcount hits 0, the original page is freed
This is why the child's address space is "identical but separate." The copy only happens on the pages that diverge. In fork() + exec(), the child immediately calls exec(), which replaces the entire address space. The copy-on-write pages are never written to — the child discards them without any copy ever happening. This is why fork() + exec() is efficient.
Huge Pages
| Page size | TLB entries needed for 2MB | TLB entries needed for 1GB |
|---|---|---|
| 4KB | 512 | 262,144 |
| 2MB | 1 | 512 |
| 1GB | — | 1 |
Huge pages reduce TLB misses by covering more memory per entry. The cost is internal fragmentation (a 2MB page for a 10KB allocation wastes 2MB-10KB) and page table simplicity (fewer levels).
Terminal emulators benefit from huge pages for:
- The terminal grid (contiguous virtual memory, frequently accessed)
- The font glyph atlas (large, read-heavy, accessed for every glyph render)
- CPU/GPU shared buffers (mmap'd, touched by both sides)
NUMA Awareness
On multi-socket systems, each CPU socket has its own memory controller and "local" memory:
flowchart TD
subgraph S0["Socket 0"]
cpu0["CPU 0-7<br/>L3 Cache"]
mc0["Memory Controller"]
end
subgraph S1["Socket 1"]
cpu1["CPU 8-15<br/>L3 Cache"]
mc1["Memory Controller"]
end
ram0["RAM 0<br/>64 GB (local to Socket 0)"]
ram1["RAM 1<br/>64 GB (local to Socket 1)"]
mc0 -->|"local access\n~100ns"| ram0
mc0 -.->|"remote access\n~130-200ns"| ram1
mc1 -->|"local access"| ram1
mc1 -.->|"remote access"| ram0
Accessing remote memory costs 1.3-2x more latency than local memory. Linux's default allocator is NUMA-aware: memory allocated on the same node as the thread that called malloc(). For terminal emulators, the I/O thread and render thread might be on different sockets — shared buffers should be allocated on the node where the most frequent accessor (the render thread) lives.
#include <numa.h>
void *buf = numa_alloc_onnode(size, node);
Project: mmap Playground
Write a program that explores the virtual address space:
// 1. mmap a 1GB private anonymous region, touch one byte per 4096
// 2. Observe RSS with /proc/self/smaps or macOS vmmap
// 3. mmap the same file twice, MAP_SHARED — show writes visible through both mappings
// 4. mmap with MAP_FIXED — drop a mapping at a specific address
// 5. mprotect a page read-only, trigger SIGSEGV by writing to it
// 6. madvise(MADV_DONTNEED) — observe RSS drop after hinting pages are no longer needed
Project: Bump and Slab Allocator
Implement both allocators and compare:
- Bump allocator: allocate 1M objects of varying sizes (16, 32, 64, 128, 256). Measure throughput.
- Slab allocator: allocate 1M objects of fixed size (64 bytes). Measure throughput.
- glibc malloc: same benchmarks. Compare throughput and fragmentation.
- Visualize fragmentation: after freeing every other allocation, what does the heap look like?
Ghostty Source to Study
- Zig's explicit allocator passing — every function that allocates takes an
Allocatorparameter - Arena allocators for frame-scoped allocations (per-render-frame data)
- FixedBufferAllocator for stack-like temporary allocations
- GeneralPurposeAllocator as the default, with leak detection in debug builds
Bridge to vLLM KV-Cache
vLLM's BlockSpaceManager manages GPU memory for KV-cache blocks. Each block stores key-value tensors for a contiguous sequence of tokens (typically 16). The manager:
- Pre-allocates a pool of GPU memory divided into fixed-size blocks
- Maintains a free list (exactly like a slab allocator's free list)
- When a request generates tokens, allocates blocks from the free list
- When a request finishes (or is preempted), returns blocks to the free list
- Chains blocks together via a block table (analogous to page tables — virtual block IDs → physical block addresses)
The fundamental operations — allocate, free, maintain free list — are identical to the slab allocator you just built. The difference is the backing store (GPU VRAM vs CPU RAM) and the access pattern (GPU kernel vs CPU).
Self-Check
Can you:
- Walk through a virtual address translation on x86-64 (all 4 levels)
- Explain what happens on a TLB miss versus a page fault — they are different things
- Name the five bin types in ptmalloc and when each is used
- Implement a bump allocator and a slab allocator from memory
- Explain why a slab allocator is exactly what vLLM's KV-cache block manager needs
- Explain the tradeoff between 4KB, 2MB, and 1GB pages
- Describe what
fork()does at the page table level and why it's fast - Explain how a terminal emulator's cell grid can benefit from huge pages