Lesson 7

Lesson 7: GPU & Rendering Pipeline

Jun 26, 2026 9 min read

The terminal renderer is the hardest performance problem in a terminal emulator. A full-screen cat of a large file can generate millions of cell updates per second. Each cell is a glyph that must be rasterized, shaped, and drawn on the GPU — all within a 16.6ms frame budget at 60Hz. After this lesson, you understand the GPU pipeline from draw call to pixel, and can render text with textured quads.

GPU Architecture

A GPU is a massively parallel processor optimized for throughput over latency. Where a CPU has 8-32 cores running at 3-5 GHz with deep pipelines and branch prediction, a GPU has thousands of cores running at 1-2 GHz with no branch prediction and hardware-managed context switching to hide memory latency.

graph TD
    A[CPU] -->|command buffer| B[GPU Command Processor]
    B --> C[Vertex Shader]
    C --> D[Rasterizer]
    D --> E[Fragment Shader]
    E --> F[Framebuffer]
    B --> G[Compute Shader]
    G --> H[GPU Memory]

The SIMT Model (Single Instruction, Multiple Threads)

GPU cores execute in groups called warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads) or SIMD groups (Apple, 32 threads). All threads in a warp execute the same instruction simultaneously. If threads diverge (take different branches), the warp executes both paths sequentially, masking out inactive threads. This is why GPU code should minimize branching — each if/else doubles execution time for the warp.

Memory hierarchy

Level	Size	Latency	Shared across
Registers	256 KB per SM	0 cycles	Single thread
Shared memory	48-164 KB per SM	~20 cycles	Warp (programmer managed)
L1 cache	128 KB per SM	~30 cycles	SM
L2 cache	2-6 MB	~200 cycles	Entire GPU
VRAM (HBM/GDDR)	8-80 GB	~400-800 cycles	Entire GPU
System RAM (via PCIe)	32-256 GB	~10,000 cycles	GPU + CPU

Terminal rendering is L1/L2 cache friendly. The terminal grid is small (typically < 1 MB for a 200×80 grid). Glyph atlas textures are fixed-size and reused every frame. Uniform buffers (projection matrices, colors) fit in registers. The entire working set of a terminal renderer fits in GPU L2 cache, making it fundamentally bandwidth-light.

The Rendering Pipeline

1. Vertex Shader

Runs once per vertex. Input: vertex attributes (position, UV coordinates). Output: transformed position in clip space. Each glyph is a quad (4 vertices, 2 triangles). The vertex shader for a terminal renderer is trivial — it transforms a unit quad to screen position:

vertex VertexOut vertex_main(VertexIn in [stage_in](/notes/stage-in),
                              constant Uniforms &u [buffer(0)](/notes/buffer-0)) {
    float2 pos = u.projection * in.position;
    return { float4(pos, 0, 1), in.uv };
}

2. Rasterizer

Hardware stage that converts triangles to fragments (candidate pixels). Interpolates vertex outputs (UV coordinates, colors) across the triangle face. Each fragment is a potential pixel.

3. Fragment Shader

Runs once per fragment. Input: interpolated vertex outputs. Output: pixel color. For text rendering, the fragment shader samples the glyph atlas texture and applies the foreground color:

fragment float4 fragment_main(VertexOut in [stage_in](/notes/stage-in),
                               texture2d<float> glyph_atlas [texture(0)](/notes/texture-0)) {
    float alpha = glyph_atlas.sample(sampler, in.uv).r;
    return float4(fg_color.rgb, fg_color.a * alpha);
}

The glyph atlas stores glyphs as single-channel (alpha) textures. The fragment shader uses the alpha channel as a mask: alpha=1 → foreground color, alpha=0 → transparent (background shows through). This is why text can have any color without re-rasterizing the glyph — the glyph shape is a mask.

4. Blending

The final stage combines the fragment output with whatever is already in the framebuffer. Alpha blending: result = src * src_alpha + dst * (1 - src_alpha). This enables subpixel antialiasing and transparency.

Texture Atlases

A texture atlas packs many small textures (glyphs) into one large texture. This avoids texture switching between draw calls — the GPU binds the atlas once and samples different regions for each glyph.

flowchart TD
    subgraph Atlas["Glyph Texture Atlas"]
        R1["A · B · C · D · E · F · G · H · I"]
        R2["J · K · L · M · N · O · P · Q · R"]
        R3["S · T · U · V · W · X · Y · Z · a"]
    end

Each cell in the terminal grid references a UV rectangle in the atlas: {(u0,v0), (u1,v1)}. The vertex shader positions the quad on screen. The fragment shader samples the atlas at the glyph's UV coordinates.

The atlas is populated lazily: when the renderer encounters a glyph not yet in the atlas, it calls FreeType to rasterize it, uploads the bitmap to the atlas texture, and records the UV coordinates. Subsequent uses of the same glyph are just texture lookups — no re-rasterization.

Atlas packing is the critical path for startup performance. The first time you open a file with many unique glyphs (CJK text, emoji, or a font with extensive ligatures), the atlas must be populated. Smart atlases use bin-packing algorithms (like Skyline) to minimize wasted space and avoid atlas growth.

FreeType: Font Rasterization

FreeType converts font outlines (quadratic/cubic Bezier curves) into pixel bitmaps at a specific size:

FT_Library library;
FT_Face face;
FT_Init_FreeType(&library);
FT_New_Face(library, "font.ttf", 0, &face);
FT_Set_Pixel_Sizes(face, 0, 14);  // 14px height
FT_Load_Char(face, 'A', FT_LOAD_RENDER);
// face->glyph->bitmap now contains the rendered glyph
// face->glyph->bitmap_left, bitmap_top: offset from pen position
// face->glyph->advance.x: how far to advance the pen

The bitmap is 8-bit grayscale (alpha). Subpixel antialiasing (ClearType on Windows, RGB subpixel rendering on Linux) renders at 3x horizontal resolution and uses the LCD's RGB subpixel structure for sharper text. Ghostty supports both grayscale and subpixel antialiasing.

HarfBuzz: Text Shaping

FreeType rasterizes individual glyphs. HarfBuzz decides which glyphs. Text shaping is non-trivial:

Ligatures: "fi" → single ligature glyph (the 'f' and 'i' merge visually)
Cursive connections: Arabic letters connect differently depending on position (isolated, initial, medial, final)
Grapheme clusters: "é" is two Unicode codepoints (e + combining acute accent) but one visual glyph
Bidirectional text: Arabic (RTL) mixed with English (LTR) requires segment analysis
Indic scripts: Devanagari ("Hindi") reorders characters: the short 'i' (ि) appears BEFORE the consonant visually but AFTER in the Unicode string

hb_buffer_t *buf = hb_buffer_create();
hb_buffer_add_utf8(buf, "Hello, 世界!", -1, 0, -1);
hb_buffer_guess_segment_properties(buf);
hb_shape(font, buf, NULL, 0);
// buf now contains positioned glyphs:
// glyph_id, x_advance, y_advance, x_offset, y_offset, cluster (original char index)

The cluster value is critical for terminal cursor positioning. The terminal grid stores characters, not glyphs. When the user presses the right arrow key, the cursor moves by one character (Unicode grapheme cluster), not by one glyph. HarfBuzz's cluster field maps each glyph back to its original character position, enabling correct cursor navigation across ligatures, combining marks, and multi-byte sequences.

The Terminal Renderer Architecture

flowchart TD
    subgraph CPU["CPU — Main Thread"]
        PTY["PTY Read"] --> VT["VT Parse"]
        VT --> Grid["Terminal Grid<br/>(dirty cells set)"]
        Grid --> Prep["Render Preparation"]
        Prep -->|"HarfBuzz shape"| Atlas["Lookup/Rasterize<br/>Glyph Atlas"]
        Prep -->|"write vertex + instance data"| Buffers["Vertex Buffer<br/>Instance Buffer"]
    end
    subgraph GPU["GPU"]
        CB["Command Buffer"]
        VS["Vertex Shader<br/>position quad"]
        FS["Fragment Shader<br/>sample atlas, apply color"]
        FB["Framebuffer<br/>blend + present (vsync)"]
        CB -->|"bind atlas + buffers"| VS
        VS --> FS
        FS --> FB
    end
    Buffers -->|"upload"| CB
    Atlas -->|"upload"| CB
    CB -->|"draw call<br/>N instances"| VS

Why instanced rendering

Each cell is the same quad geometry with different parameters (glyph UV, colors). Instanced rendering sends the quad geometry once and the per-cell data as an instance buffer. One draw call renders the entire terminal grid. This is vastly more efficient than one draw call per cell.

// Vertex shader with instancing
vertex VertexOut vertex_main(VertexIn in [stage_in](/notes/stage-in),
                              InstanceData *inst [buffer(1)](/notes/buffer-1),
                              uint instance_id [instance_id](/notes/instance-id)) {
    InstanceData cell = inst[instance_id];
    // Position the quad at cell.grid_x, cell.grid_y
    // Output cell.glyph_uv for the fragment shader
}

Project: render-demo

Render "Hello, 世界!" with textured quads:

Initialize a GPU context (Metal on macOS, Vulkan/OpenGL on Linux)
Load a font with FreeType
Rasterize each glyph and pack into a texture atlas
Shape the text with HarfBuzz to get glyph positions
Create vertex buffers for quads (one quad per glyph)
Write vertex and fragment shaders (textured quad rendering)
Issue one instanced draw call
Verify: "Hello, 世界!" appears correctly on screen

Ghostty Source to Study

File	What to study
`src/renderer/Metal.zig`	Metal backend: shader compilation, buffer management, draw call dispatch
`src/renderer/OpenGL.zig`	OpenGL backend for Linux/Windows
`src/font/`	Font discovery (CoreText on macOS, fontconfig on Linux), face management, atlas packing
`src/font/shaper.zig`	HarfBuzz integration: text shaping, cluster mapping, glyph positioning

Bridge to vLLM CUDA Kernels

A GPU running a CUDA kernel follows the same architecture as a GPU running a vertex shader:

Data in GPU memory: terminal glyph atlas ↔ vLLM KV-cache tensors
Kernel launch: Metal draw call ↔ CUDA kernel launch (<<<grid, block>>>)
Parallelism model: one GPU thread per vertex ↔ one GPU thread per attention head element
Memory hierarchy: shared memory (fast, small) ↔ register spilling (slow)
Performance optimization: minimize global memory accesses, maximize shared memory reuse, avoid warp divergence

The key difference: rendering kernels write to a framebuffer (visual). CUDA kernels write to tensors (numerical). The GPU doesn't care — it's all just parallel threads reading memory, doing math, and writing memory.

Self-Check

Can you:

Explain why a GPU has thousands of cores but each is slower than a CPU core (throughput vs latency)
Draw the GPU pipeline: vertex shader → rasterizer → fragment shader → framebuffer
Explain why texture atlases avoid texture switching overhead
Explain what HarfBuzz does that FreeType doesn't (shaping vs rasterization)
Explain why instanced rendering is critical for terminal performance (one draw call, not N)
Trace a glyph from Unicode codepoint to pixel on screen: HarfBuzz shape → FreeType rasterize → atlas upload → vertex shader position → fragment shader sample → blend
Explain why subpixel antialiasing requires 3x horizontal resolution and knowledge of the LCD subpixel layout

GPU Architecture #

The SIMT Model (Single Instruction, Multiple Threads) #

Memory hierarchy #

The Rendering Pipeline #

1. Vertex Shader #

2. Rasterizer #

3. Fragment Shader #

4. Blending #

Texture Atlases #

FreeType: Font Rasterization #

HarfBuzz: Text Shaping #

The Terminal Renderer Architecture #

Why instanced rendering #

Project: render-demo #

Ghostty Source to Study #

Bridge to vLLM CUDA Kernels #

Self-Check #

GPU Architecture

The SIMT Model (Single Instruction, Multiple Threads)

Memory hierarchy

The Rendering Pipeline

1. Vertex Shader

2. Rasterizer

3. Fragment Shader

4. Blending

Texture Atlases

FreeType: Font Rasterization

HarfBuzz: Text Shaping

The Terminal Renderer Architecture

Why instanced rendering

Project: render-demo

Ghostty Source to Study

Bridge to vLLM CUDA Kernels

Self-Check