Field guide · v2026.05 · Direction A · Rim-lit hovering city

The 2026 AI inference stack, one click deep.

Eight layers between a question and an answer. From the chat box on your screen all the way down to a GPU in a datacenter. Pick a layer to see what lives there, why it exists, and what matters in 2026.

Watch a query → Browse glossary

Tap a slab to focus it. Hit "Drill in →" in the panel to open the layer.

Layer 06 of 8

LLM (the model)

Weights · params · architecture

The brain. A giant pile of learned patterns that, given the conversation so far, picks the next word. Different models trade off four things: smart, fast, cheap, and small. In 2026, the frontier (Claude Opus 4.7, GPT-5, Gemini 3) is genuinely good at multi-step agentic work and writes code at a level most engineers find useful. A class below — Sonnet 4.6, GPT-5 mini, Gemini 3 Flash — is where most production traffic lives because it is roughly 5x cheaper and only modestly less capable. Open-weight options (Llama 4, Qwen, DeepSeek, Mistral) close the gap fast and run on your own hardware. The smallest phone-sized models do classification, routing, and basic generation locally with no network round-trip.

Examples in the wild

Claude Sonnet 4.6GPT-5Gemini 3Llama 4 (local)

Connects to

↑ Inference API↑ Context window↓ Inference engine↔ Fine-tuning

Drill in → Watch a query →

All layers