Lesson 5: Threads & Lock-Free Concurrency
After this lesson, you understand atomics from the hardware memory model up, can implement a lock-free SPSC queue (exactly what Ghostty uses for its thread mailbox), and can explain why vLLM's continuous batching is essentially a work-stealing thread pool applied to GPU inference.
Why Terminal Emulators Need Threads
A terminal emulator has three latency-critical paths that cannot block each other:
- I/O thread: reads from the PTY master fd, writes parsed data to the terminal state. Must never block — if it blocks, user input is delayed.
- Parser thread: processes the byte stream, updating the terminal grid and state. CPU-intensive (especially during
cat bigfile). - Render thread: reads the terminal grid and paints it to the screen via GPU. Must hit vsync deadlines (16.6ms at 60Hz, 8.3ms at 120Hz). If the render thread blocks, the screen stutters.
Ghostty uses three threads communicating through lock-free data structures. The key insight: each thread owns its data. The I/O thread pushes into the parser's queue. The parser pushes into the renderer's queue. No shared mutable state — only messages.
graph LR
A[I/O Thread<br>read PTY] -->|SPSC queue| B[Parser Thread<br>VT state machine]
B -->|SPSC queue| C[Render Thread<br>GPU draw]
pthreads
#include <pthread.h>
void *thread_fn(void *arg) {
int *data = (int *)arg;
*data = 42;
return NULL;
}
pthread_t thread;
int arg = 0;
pthread_create(&thread, NULL, thread_fn, &arg);
pthread_join(thread, NULL); // wait for thread, reclaim resources
// arg is now 42
Three ways a thread can end:
- Return from
thread_fn(cleanest) pthread_exit()(explicit exit from anywhere in the call stack)- Cancelled by another thread (
pthread_cancel)
If you never join a finished thread, its resources leak (analogous to a zombie process). pthread_detach() marks a thread as "don't wait for me" — resources are reclaimed automatically when the thread exits, but you can't join it.
Thread-Local Storage (TLS)
__thread int my_var; // each thread gets its own copy
pthread_getspecific(key); // more flexible: arbitrary pointers
TLS is how per-thread arenas in ptmalloc work — each thread has its own heap arena pointer in TLS, avoiding contention on the global malloc lock.
Mutex and Condition Variable
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int ready = 0;
// Waiter
pthread_mutex_lock(&mutex);
while (!ready) {
pthread_cond_wait(&cond, &mutex);
}
// consume data
pthread_mutex_unlock(&mutex);
// Signaler
pthread_mutex_lock(&mutex);
ready = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);
Always use a while loop around pthread_cond_wait(), not an if. Spurious wakeups happen (the kernel may wake a waiter even though no signal was sent). The while loop rechecks the condition.
Futex: The Kernel Primitive Underneath
Mutexes and condition variables are user-space data structures backed by the futex (fast userspace mutex) system call. The fast path (uncontended lock) never enters the kernel:
lock():
atomic_compare_exchange(mutex, 0, 1) // try to acquire
if (succeeded) return; // fast path: done
futex(FUTEX_WAIT, &mutex, 1); // slow path: kernel sleeps us
unlock():
atomic_store(mutex, 0); // release
futex(FUTEX_WAKE, &mutex, 1); // wake one waiter
The key insight: atomic operations happen in user space. The kernel is only involved when a thread needs to sleep or wake up. This is why uncontended mutex acquisition is ~25ns (just a CAS) but contended acquisition involves a context switch (~1-10µs).
Atomics Deep Dive
C11 defines a memory model with four ordering guarantees:
| Ordering | Guarantee | Use case |
|---|---|---|
| Relaxed | Atomicity only. No ordering relative to other accesses. | Counters, statistics, monotonically increasing IDs |
| Acquire | All subsequent reads/writes see state at least as recent as this load. | Loading a "ready" flag before reading published data |
| Release | All prior reads/writes are visible before this store. | Publishing data before setting "ready" flag |
| SeqCst | Total global order of all seq_cst operations. Every thread agrees on the order. | The default. Most expensive. Rarely necessary. |
The producer-consumer pair
// Shared
int data[N]; // written by producer, read by consumer
atomic_int ready; // 0 = no data, 1 = data available
// Producer
data[0] = 42;
data[1] = 99;
atomic_store_explicit(&ready, 1, memory_order_release); // publish
// Consumer
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {
// spin
}
// Now safe to read data[0], data[1]
// Guaranteed to see 42 and 99
The release/acquire pair creates a happens-before relationship. The producer's writes before the release are visible to the consumer after the acquire. No mutex. No kernel involvement. Just ordering constraints on the CPU's store buffer and cache coherence protocol.
On x86, acquire/release are free. x86's Total Store Order (TSO) model already guarantees that stores appear in program order and loads are not reordered with older stores. memory_order_release generates no barrier instructions. memory_order_acquire generates no barrier instructions. Only memory_order_seq_cst needs mfence. On ARM, acquire/release need explicit barriers (dmb instruction).
What goes wrong without ordering
// Producer Consumer
data = 42; while (!ready) {} // acquire
ready = 1; // relaxed print(data); // ???
// BUG: consumer might see // Might print 0!
// ready=1 before data=42
Without release/acquire, the CPU's store buffer can reorder the producer's stores. The consumer sees ready=1 but the store to data is still in the producer's store buffer, invisible to the consumer.
Lock-Free SPSC Queue
A Single-Producer Single-Consumer queue using a fixed-size ring buffer, with head and tail indices padded to separate cache lines to prevent false sharing.
#define CACHE_LINE 64
#define CAPACITY 1024
typedef struct {
int data[CAPACITY];
atomic_int head __attribute__((aligned(CACHE_LINE)));
atomic_int tail __attribute__((aligned(CACHE_LINE)));
} spsc_queue_t;
// Producer
bool spsc_push(spsc_queue_t *q, int val) {
int head = atomic_load_explicit(&q->head, memory_order_relaxed);
int next = (head + 1) % CAPACITY;
int tail = atomic_load_explicit(&q->tail, memory_order_acquire);
if (next == tail) return false; // full
q->data[head] = val;
atomic_store_explicit(&q->head, next, memory_order_release);
return true;
}
// Consumer
bool spsc_pop(spsc_queue_t *q, int *val) {
int tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
int head = atomic_load_explicit(&q->head, memory_order_acquire);
if (head == tail) return false; // empty
*val = q->data[tail];
int next = (tail + 1) % CAPACITY;
atomic_store_explicit(&q->tail, next, memory_order_release);
return true;
}
Why only the producer writes head and only the consumer writes tail: this eliminates the need for CAS loops (Compare-And-Swap). CAS is more expensive than a simple store on x86. The producer owns the write to head; the consumer owns the write to tail. The other party only reads them. This is the key insight that makes SPSC queues faster than MPMC queues.
Why it's lock-free (and what "lock-free" means)
No thread can block another. If the producer is preempted after writing data but before updating head, the consumer sees an empty queue and returns false — it doesn't spin forever. Compare to a mutex: if a thread holding the mutex is preempted, all waiters block.
Wait-free is stronger: every thread makes progress in a bounded number of steps, regardless of what other threads do. Lock-free means at least one thread makes progress in a bounded number of steps. SPSC push/pop are wait-free for the single producer/consumer.
Thread Pools and Work Stealing
A thread pool maintains N worker threads and a shared work queue. Workers take tasks from the queue. When a worker's local queue is empty, it steals tasks from another worker's queue.
// Each worker has:
// - A local deque (work-stealing queue)
// - A pointer to the global pool for stealing
// Worker adds work (to its own end):
void push(worker_t *w, task_t *task) {
w->deque[w->tail++] = task; // only this worker writes to its own deque
}
// Worker pops work (from its own end, LIFO):
task_t *pop(worker_t *w) {
return w->deque[--w->tail];
}
// Steal from another worker (from their opposite end, FIFO):
task_t *steal(worker_t *victim) {
int head = atomic_load(&victim->head, acquire);
int tail = atomic_load(&victim->tail, acquire);
if (head >= tail) return NULL;
task_t *task = victim->deque[head];
// CAS to claim the task (another thief might race us)
if (atomic_compare_exchange_strong(&victim->head, &head, head + 1, release, relaxed)) {
return task;
}
return NULL; // lost the race
}
The asymmetry (owner pushes/pops from tail, thieves steal from head) minimizes contention: the owner only touches the tail, thieves only touch the head.
Bridge to vLLM continuous batching: vLLM schedules GPU inference across requests. Each request has a sequence of tokens to process. The scheduler picks which requests to batch together for the next forward pass. This is analogous to a thread pool: requests are tasks, the GPU is the worker, and "continuous batching" means the GPU doesn't wait for all requests in a batch to finish before starting new ones — it steals ready-to-process sequences dynamically, exactly like work stealing.
False Sharing Benchmark (Project)
Build this to feel false sharing in your fingers:
// Version 1: same cache line (SLOW)
struct { atomic_int a; atomic_int b; } counters;
// Version 2: separate cache lines (FAST)
struct { atomic_int a; char _pad[60]; } counter_a;
struct { atomic_int b; char _pad[60]; } counter_b;
// Two threads, each increments its own counter 100M times.
// Version 1: ~10x slower due to cache line ping-pong.
// Version 2: ~10x faster because each thread owns its cache line.
Graph the throughput versus thread count. Explain the MESI state transitions: in version 1, every write by Thread A invalidates Thread B's cache line (Modified → Invalid). Thread B's next write must fetch the line from Thread A's cache (Invalid → Shared → Modified). Two state transitions per operation. In version 2, each thread's cache line stays in Modified state — no transitions, full speed.
Ghostty Source to Study
| File | What to study |
|---|---|
src/termio/mailbox.zig |
Lock-free SPSC queue with cache-line padding. The core IPC primitive. |
src/Surface.zig |
Thread orchestration: I/O thread, parser, renderer. How messages flow between threads. |
src/renderer/Thread.zig |
Render thread main loop: dequeue render updates, draw, present. |
Self-Check
Can you:
- Explain why a terminal emulator needs threads and what each thread does
- Write
pthread_create+pthread_joinfrom memory - Explain the difference between lock-free and wait-free
- Write a lock-free SPSC queue with release/acquire barriers and explain every ordering constraint
- Explain why the SPSC queue doesn't need CAS (single-producer, single-consumer means no contention on head/tail writes)
- Draw the MESI transitions for false sharing and explain why padding fixes it
- Explain how vLLM's continuous batching is conceptually similar to work stealing
- Write a thread pool with work stealing