Lesson 5

Lesson 5: Threads & Lock-Free Concurrency

Jun 26, 2026 9 min read

After this lesson, you understand atomics from the hardware memory model up, can implement a lock-free SPSC queue (exactly what Ghostty uses for its thread mailbox), and can explain why vLLM's continuous batching is essentially a work-stealing thread pool applied to GPU inference.

Why Terminal Emulators Need Threads

A terminal emulator has three latency-critical paths that cannot block each other:

I/O thread: reads from the PTY master fd, writes parsed data to the terminal state. Must never block — if it blocks, user input is delayed.
Parser thread: processes the byte stream, updating the terminal grid and state. CPU-intensive (especially during cat bigfile).
Render thread: reads the terminal grid and paints it to the screen via GPU. Must hit vsync deadlines (16.6ms at 60Hz, 8.3ms at 120Hz). If the render thread blocks, the screen stutters.

Ghostty uses three threads communicating through lock-free data structures. The key insight: each thread owns its data. The I/O thread pushes into the parser's queue. The parser pushes into the renderer's queue. No shared mutable state — only messages.

graph LR
    A[I/O Thread<br>read PTY] -->|SPSC queue| B[Parser Thread<br>VT state machine]
    B -->|SPSC queue| C[Render Thread<br>GPU draw]

pthreads

#include <pthread.h>

void *thread_fn(void *arg) {
    int *data = (int *)arg;
    *data = 42;
    return NULL;
}

pthread_t thread;
int arg = 0;
pthread_create(&thread, NULL, thread_fn, &arg);
pthread_join(thread, NULL);  // wait for thread, reclaim resources
// arg is now 42

Three ways a thread can end:

Return from thread_fn (cleanest)
pthread_exit() (explicit exit from anywhere in the call stack)
Cancelled by another thread (pthread_cancel)

If you never join a finished thread, its resources leak (analogous to a zombie process). pthread_detach() marks a thread as "don't wait for me" — resources are reclaimed automatically when the thread exits, but you can't join it.

Thread-Local Storage (TLS)

__thread int my_var;  // each thread gets its own copy
pthread_getspecific(key);  // more flexible: arbitrary pointers

TLS is how per-thread arenas in ptmalloc work — each thread has its own heap arena pointer in TLS, avoiding contention on the global malloc lock.

Mutex and Condition Variable

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int ready = 0;

// Waiter
pthread_mutex_lock(&mutex);
while (!ready) {
    pthread_cond_wait(&cond, &mutex);
}
// consume data
pthread_mutex_unlock(&mutex);

// Signaler
pthread_mutex_lock(&mutex);
ready = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);

Always use a while loop around pthread_cond_wait(), not an if. Spurious wakeups happen (the kernel may wake a waiter even though no signal was sent). The while loop rechecks the condition.

Futex: The Kernel Primitive Underneath

Mutexes and condition variables are user-space data structures backed by the futex (fast userspace mutex) system call. The fast path (uncontended lock) never enters the kernel:

lock():
    atomic_compare_exchange(mutex, 0, 1)  // try to acquire
    if (succeeded) return;                  // fast path: done
    futex(FUTEX_WAIT, &mutex, 1);          // slow path: kernel sleeps us

unlock():
    atomic_store(mutex, 0);                // release
    futex(FUTEX_WAKE, &mutex, 1);          // wake one waiter

The key insight: atomic operations happen in user space. The kernel is only involved when a thread needs to sleep or wake up. This is why uncontended mutex acquisition is ~25ns (just a CAS) but contended acquisition involves a context switch (~1-10µs).

Atomics Deep Dive

C11 defines a memory model with four ordering guarantees:

Ordering	Guarantee	Use case
Relaxed	Atomicity only. No ordering relative to other accesses.	Counters, statistics, monotonically increasing IDs
Acquire	All subsequent reads/writes see state at least as recent as this load.	Loading a "ready" flag before reading published data
Release	All prior reads/writes are visible before this store.	Publishing data before setting "ready" flag
SeqCst	Total global order of all seq_cst operations. Every thread agrees on the order.	The default. Most expensive. Rarely necessary.

The producer-consumer pair

// Shared
int data[N];          // written by producer, read by consumer
atomic_int ready;     // 0 = no data, 1 = data available

// Producer
data[0] = 42;
data[1] = 99;
atomic_store_explicit(&ready, 1, memory_order_release);  // publish

// Consumer
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {
    // spin
}
// Now safe to read data[0], data[1]
// Guaranteed to see 42 and 99

The release/acquire pair creates a happens-before relationship. The producer's writes before the release are visible to the consumer after the acquire. No mutex. No kernel involvement. Just ordering constraints on the CPU's store buffer and cache coherence protocol.

On x86, acquire/release are free. x86's Total Store Order (TSO) model already guarantees that stores appear in program order and loads are not reordered with older stores. memory_order_release generates no barrier instructions. memory_order_acquire generates no barrier instructions. Only memory_order_seq_cst needs mfence. On ARM, acquire/release need explicit barriers (dmb instruction).

What goes wrong without ordering

// Producer                    Consumer
data = 42;                     while (!ready) {}  // acquire
ready = 1;  // relaxed        print(data);        // ???
//  BUG: consumer might see   // Might print 0!
//  ready=1 before data=42

Without release/acquire, the CPU's store buffer can reorder the producer's stores. The consumer sees ready=1 but the store to data is still in the producer's store buffer, invisible to the consumer.

Lock-Free SPSC Queue

A Single-Producer Single-Consumer queue using a fixed-size ring buffer, with head and tail indices padded to separate cache lines to prevent false sharing.

#define CACHE_LINE 64
#define CAPACITY 1024

typedef struct {
    int data[CAPACITY];
    atomic_int head __attribute__((aligned(CACHE_LINE)));
    atomic_int tail __attribute__((aligned(CACHE_LINE)));
} spsc_queue_t;

// Producer
bool spsc_push(spsc_queue_t *q, int val) {
    int head = atomic_load_explicit(&q->head, memory_order_relaxed);
    int next = (head + 1) % CAPACITY;
    int tail = atomic_load_explicit(&q->tail, memory_order_acquire);
    if (next == tail) return false;  // full
    q->data[head] = val;
    atomic_store_explicit(&q->head, next, memory_order_release);
    return true;
}

// Consumer
bool spsc_pop(spsc_queue_t *q, int *val) {
    int tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
    int head = atomic_load_explicit(&q->head, memory_order_acquire);
    if (head == tail) return false;  // empty
    *val = q->data[tail];
    int next = (tail + 1) % CAPACITY;
    atomic_store_explicit(&q->tail, next, memory_order_release);
    return true;
}

Why only the producer writes head and only the consumer writes tail: this eliminates the need for CAS loops (Compare-And-Swap). CAS is more expensive than a simple store on x86. The producer owns the write to head; the consumer owns the write to tail. The other party only reads them. This is the key insight that makes SPSC queues faster than MPMC queues.

Why it's lock-free (and what "lock-free" means)

No thread can block another. If the producer is preempted after writing data but before updating head, the consumer sees an empty queue and returns false — it doesn't spin forever. Compare to a mutex: if a thread holding the mutex is preempted, all waiters block.

Wait-free is stronger: every thread makes progress in a bounded number of steps, regardless of what other threads do. Lock-free means at least one thread makes progress in a bounded number of steps. SPSC push/pop are wait-free for the single producer/consumer.

Thread Pools and Work Stealing

A thread pool maintains N worker threads and a shared work queue. Workers take tasks from the queue. When a worker's local queue is empty, it steals tasks from another worker's queue.

// Each worker has:
// - A local deque (work-stealing queue)
// - A pointer to the global pool for stealing

// Worker adds work (to its own end):
void push(worker_t *w, task_t *task) {
    w->deque[w->tail++] = task;  // only this worker writes to its own deque
}

// Worker pops work (from its own end, LIFO):
task_t *pop(worker_t *w) {
    return w->deque[--w->tail];
}

// Steal from another worker (from their opposite end, FIFO):
task_t *steal(worker_t *victim) {
    int head = atomic_load(&victim->head, acquire);
    int tail = atomic_load(&victim->tail, acquire);
    if (head >= tail) return NULL;
    task_t *task = victim->deque[head];
    // CAS to claim the task (another thief might race us)
    if (atomic_compare_exchange_strong(&victim->head, &head, head + 1, release, relaxed)) {
        return task;
    }
    return NULL;  // lost the race
}

The asymmetry (owner pushes/pops from tail, thieves steal from head) minimizes contention: the owner only touches the tail, thieves only touch the head.

Bridge to vLLM continuous batching: vLLM schedules GPU inference across requests. Each request has a sequence of tokens to process. The scheduler picks which requests to batch together for the next forward pass. This is analogous to a thread pool: requests are tasks, the GPU is the worker, and "continuous batching" means the GPU doesn't wait for all requests in a batch to finish before starting new ones — it steals ready-to-process sequences dynamically, exactly like work stealing.

Build this to feel false sharing in your fingers:

// Version 1: same cache line (SLOW)
struct { atomic_int a; atomic_int b; } counters;

// Version 2: separate cache lines (FAST)
struct { atomic_int a; char _pad[60]; } counter_a;
struct { atomic_int b; char _pad[60]; } counter_b;

// Two threads, each increments its own counter 100M times.
// Version 1: ~10x slower due to cache line ping-pong.
// Version 2: ~10x faster because each thread owns its cache line.

Graph the throughput versus thread count. Explain the MESI state transitions: in version 1, every write by Thread A invalidates Thread B's cache line (Modified → Invalid). Thread B's next write must fetch the line from Thread A's cache (Invalid → Shared → Modified). Two state transitions per operation. In version 2, each thread's cache line stays in Modified state — no transitions, full speed.

Ghostty Source to Study

File	What to study
`src/termio/mailbox.zig`	Lock-free SPSC queue with cache-line padding. The core IPC primitive.
`src/Surface.zig`	Thread orchestration: I/O thread, parser, renderer. How messages flow between threads.
`src/renderer/Thread.zig`	Render thread main loop: dequeue render updates, draw, present.

Self-Check

Can you:

Explain why a terminal emulator needs threads and what each thread does
Write pthread_create + pthread_join from memory
Explain the difference between lock-free and wait-free
Write a lock-free SPSC queue with release/acquire barriers and explain every ordering constraint
Explain why the SPSC queue doesn't need CAS (single-producer, single-consumer means no contention on head/tail writes)
Draw the MESI transitions for false sharing and explain why padding fixes it
Explain how vLLM's continuous batching is conceptually similar to work stealing
Write a thread pool with work stealing

Why Terminal Emulators Need Threads #

pthreads #

Thread-Local Storage (TLS) #

Mutex and Condition Variable #

Futex: The Kernel Primitive Underneath #

Atomics Deep Dive #

The producer-consumer pair #

What goes wrong without ordering #

Lock-Free SPSC Queue #

Why it's lock-free (and what "lock-free" means) #

Thread Pools and Work Stealing #

False Sharing Benchmark (Project) #

Ghostty Source to Study #

Self-Check #

Why Terminal Emulators Need Threads

pthreads

Thread-Local Storage (TLS)

Mutex and Condition Variable

Futex: The Kernel Primitive Underneath

Atomics Deep Dive

The producer-consumer pair

What goes wrong without ordering

Lock-Free SPSC Queue

Why it's lock-free (and what "lock-free" means)

Thread Pools and Work Stealing

False Sharing Benchmark (Project)

Ghostty Source to Study

Self-Check