Lesson 7: GPU & Rendering Pipeline
The terminal renderer is the hardest performance problem in a terminal emulator. A full-screen cat of a large file can generate millions of cell updates per second. Each cell is a glyph that must be rasterized, shaped, and drawn on the GPU — all within a 16.6ms frame budget at 60Hz. After this lesson, you understand the GPU pipeline from draw call to pixel, and can render text with textured quads.
GPU Architecture
A GPU is a massively parallel processor optimized for throughput over latency. Where a CPU has 8-32 cores running at 3-5 GHz with deep pipelines and branch prediction, a GPU has thousands of cores running at 1-2 GHz with no branch prediction and hardware-managed context switching to hide memory latency.
graph TD
A[CPU] -->|command buffer| B[GPU Command Processor]
B --> C[Vertex Shader]
C --> D[Rasterizer]
D --> E[Fragment Shader]
E --> F[Framebuffer]
B --> G[Compute Shader]
G --> H[GPU Memory]
The SIMT Model (Single Instruction, Multiple Threads)
GPU cores execute in groups called warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads) or SIMD groups (Apple, 32 threads). All threads in a warp execute the same instruction simultaneously. If threads diverge (take different branches), the warp executes both paths sequentially, masking out inactive threads. This is why GPU code should minimize branching — each if/else doubles execution time for the warp.
Memory hierarchy
| Level | Size | Latency | Shared across |
|---|---|---|---|
| Registers | 256 KB per SM | 0 cycles | Single thread |
| Shared memory | 48-164 KB per SM | ~20 cycles | Warp (programmer managed) |
| L1 cache | 128 KB per SM | ~30 cycles | SM |
| L2 cache | 2-6 MB | ~200 cycles | Entire GPU |
| VRAM (HBM/GDDR) | 8-80 GB | ~400-800 cycles | Entire GPU |
| System RAM (via PCIe) | 32-256 GB | ~10,000 cycles | GPU + CPU |
Terminal rendering is L1/L2 cache friendly. The terminal grid is small (typically < 1 MB for a 200×80 grid). Glyph atlas textures are fixed-size and reused every frame. Uniform buffers (projection matrices, colors) fit in registers. The entire working set of a terminal renderer fits in GPU L2 cache, making it fundamentally bandwidth-light.
The Rendering Pipeline
1. Vertex Shader
Runs once per vertex. Input: vertex attributes (position, UV coordinates). Output: transformed position in clip space. Each glyph is a quad (4 vertices, 2 triangles). The vertex shader for a terminal renderer is trivial — it transforms a unit quad to screen position:
vertex VertexOut vertex_main(VertexIn in [stage_in](/notes/stage-in),
constant Uniforms &u [buffer(0)](/notes/buffer-0)) {
float2 pos = u.projection * in.position;
return { float4(pos, 0, 1), in.uv };
}
2. Rasterizer
Hardware stage that converts triangles to fragments (candidate pixels). Interpolates vertex outputs (UV coordinates, colors) across the triangle face. Each fragment is a potential pixel.
3. Fragment Shader
Runs once per fragment. Input: interpolated vertex outputs. Output: pixel color. For text rendering, the fragment shader samples the glyph atlas texture and applies the foreground color:
fragment float4 fragment_main(VertexOut in [stage_in](/notes/stage-in),
texture2d<float> glyph_atlas [texture(0)](/notes/texture-0)) {
float alpha = glyph_atlas.sample(sampler, in.uv).r;
return float4(fg_color.rgb, fg_color.a * alpha);
}
The glyph atlas stores glyphs as single-channel (alpha) textures. The fragment shader uses the alpha channel as a mask: alpha=1 → foreground color, alpha=0 → transparent (background shows through). This is why text can have any color without re-rasterizing the glyph — the glyph shape is a mask.
4. Blending
The final stage combines the fragment output with whatever is already in the framebuffer. Alpha blending: result = src * src_alpha + dst * (1 - src_alpha). This enables subpixel antialiasing and transparency.
Texture Atlases
A texture atlas packs many small textures (glyphs) into one large texture. This avoids texture switching between draw calls — the GPU binds the atlas once and samples different regions for each glyph.
flowchart TD
subgraph Atlas["Glyph Texture Atlas"]
R1["A · B · C · D · E · F · G · H · I"]
R2["J · K · L · M · N · O · P · Q · R"]
R3["S · T · U · V · W · X · Y · Z · a"]
end
Each cell in the terminal grid references a UV rectangle in the atlas: {(u0,v0), (u1,v1)}. The vertex shader positions the quad on screen. The fragment shader samples the atlas at the glyph's UV coordinates.
The atlas is populated lazily: when the renderer encounters a glyph not yet in the atlas, it calls FreeType to rasterize it, uploads the bitmap to the atlas texture, and records the UV coordinates. Subsequent uses of the same glyph are just texture lookups — no re-rasterization.
Atlas packing is the critical path for startup performance. The first time you open a file with many unique glyphs (CJK text, emoji, or a font with extensive ligatures), the atlas must be populated. Smart atlases use bin-packing algorithms (like Skyline) to minimize wasted space and avoid atlas growth.
FreeType: Font Rasterization
FreeType converts font outlines (quadratic/cubic Bezier curves) into pixel bitmaps at a specific size:
FT_Library library;
FT_Face face;
FT_Init_FreeType(&library);
FT_New_Face(library, "font.ttf", 0, &face);
FT_Set_Pixel_Sizes(face, 0, 14); // 14px height
FT_Load_Char(face, 'A', FT_LOAD_RENDER);
// face->glyph->bitmap now contains the rendered glyph
// face->glyph->bitmap_left, bitmap_top: offset from pen position
// face->glyph->advance.x: how far to advance the pen
The bitmap is 8-bit grayscale (alpha). Subpixel antialiasing (ClearType on Windows, RGB subpixel rendering on Linux) renders at 3x horizontal resolution and uses the LCD's RGB subpixel structure for sharper text. Ghostty supports both grayscale and subpixel antialiasing.
HarfBuzz: Text Shaping
FreeType rasterizes individual glyphs. HarfBuzz decides which glyphs. Text shaping is non-trivial:
- Ligatures: "fi" → single ligature glyph (the 'f' and 'i' merge visually)
- Cursive connections: Arabic letters connect differently depending on position (isolated, initial, medial, final)
- Grapheme clusters: "é" is two Unicode codepoints (e + combining acute accent) but one visual glyph
- Bidirectional text: Arabic (RTL) mixed with English (LTR) requires segment analysis
- Indic scripts: Devanagari ("Hindi") reorders characters: the short 'i' (ि) appears BEFORE the consonant visually but AFTER in the Unicode string
hb_buffer_t *buf = hb_buffer_create();
hb_buffer_add_utf8(buf, "Hello, 世界!", -1, 0, -1);
hb_buffer_guess_segment_properties(buf);
hb_shape(font, buf, NULL, 0);
// buf now contains positioned glyphs:
// glyph_id, x_advance, y_advance, x_offset, y_offset, cluster (original char index)
The cluster value is critical for terminal cursor positioning. The terminal grid stores characters, not glyphs. When the user presses the right arrow key, the cursor moves by one character (Unicode grapheme cluster), not by one glyph. HarfBuzz's cluster field maps each glyph back to its original character position, enabling correct cursor navigation across ligatures, combining marks, and multi-byte sequences.
The Terminal Renderer Architecture
flowchart TD
subgraph CPU["CPU — Main Thread"]
PTY["PTY Read"] --> VT["VT Parse"]
VT --> Grid["Terminal Grid<br/>(dirty cells set)"]
Grid --> Prep["Render Preparation"]
Prep -->|"HarfBuzz shape"| Atlas["Lookup/Rasterize<br/>Glyph Atlas"]
Prep -->|"write vertex + instance data"| Buffers["Vertex Buffer<br/>Instance Buffer"]
end
subgraph GPU["GPU"]
CB["Command Buffer"]
VS["Vertex Shader<br/>position quad"]
FS["Fragment Shader<br/>sample atlas, apply color"]
FB["Framebuffer<br/>blend + present (vsync)"]
CB -->|"bind atlas + buffers"| VS
VS --> FS
FS --> FB
end
Buffers -->|"upload"| CB
Atlas -->|"upload"| CB
CB -->|"draw call<br/>N instances"| VS
Why instanced rendering
Each cell is the same quad geometry with different parameters (glyph UV, colors). Instanced rendering sends the quad geometry once and the per-cell data as an instance buffer. One draw call renders the entire terminal grid. This is vastly more efficient than one draw call per cell.
// Vertex shader with instancing
vertex VertexOut vertex_main(VertexIn in [stage_in](/notes/stage-in),
InstanceData *inst [buffer(1)](/notes/buffer-1),
uint instance_id [instance_id](/notes/instance-id)) {
InstanceData cell = inst[instance_id];
// Position the quad at cell.grid_x, cell.grid_y
// Output cell.glyph_uv for the fragment shader
}
Project: render-demo
Render "Hello, 世界!" with textured quads:
- Initialize a GPU context (Metal on macOS, Vulkan/OpenGL on Linux)
- Load a font with FreeType
- Rasterize each glyph and pack into a texture atlas
- Shape the text with HarfBuzz to get glyph positions
- Create vertex buffers for quads (one quad per glyph)
- Write vertex and fragment shaders (textured quad rendering)
- Issue one instanced draw call
- Verify: "Hello, 世界!" appears correctly on screen
Ghostty Source to Study
| File | What to study |
|---|---|
src/renderer/Metal.zig |
Metal backend: shader compilation, buffer management, draw call dispatch |
src/renderer/OpenGL.zig |
OpenGL backend for Linux/Windows |
src/font/ |
Font discovery (CoreText on macOS, fontconfig on Linux), face management, atlas packing |
src/font/shaper.zig |
HarfBuzz integration: text shaping, cluster mapping, glyph positioning |
Bridge to vLLM CUDA Kernels
A GPU running a CUDA kernel follows the same architecture as a GPU running a vertex shader:
- Data in GPU memory: terminal glyph atlas ↔ vLLM KV-cache tensors
- Kernel launch: Metal draw call ↔ CUDA kernel launch (
<<<grid, block>>>) - Parallelism model: one GPU thread per vertex ↔ one GPU thread per attention head element
- Memory hierarchy: shared memory (fast, small) ↔ register spilling (slow)
- Performance optimization: minimize global memory accesses, maximize shared memory reuse, avoid warp divergence
The key difference: rendering kernels write to a framebuffer (visual). CUDA kernels write to tensors (numerical). The GPU doesn't care — it's all just parallel threads reading memory, doing math, and writing memory.
Self-Check
Can you:
- Explain why a GPU has thousands of cores but each is slower than a CPU core (throughput vs latency)
- Draw the GPU pipeline: vertex shader → rasterizer → fragment shader → framebuffer
- Explain why texture atlases avoid texture switching overhead
- Explain what HarfBuzz does that FreeType doesn't (shaping vs rasterization)
- Explain why instanced rendering is critical for terminal performance (one draw call, not N)
- Trace a glyph from Unicode codepoint to pixel on screen: HarfBuzz shape → FreeType rasterize → atlas upload → vertex shader position → fragment shader sample → blend
- Explain why subpixel antialiasing requires 3x horizontal resolution and knowledge of the LCD subpixel layout