End-to-End LLM Stack Trace: From Keystroke to Streamed Token

A software engineer has published a detailed technical document that traces exactly what happens at every layer of the stack when you send a prompt to an LLM like Claude or ChatGPT. Inspired by the classic "what-happens-when" repository for browser navigation, this document provides a production systems perspective on LLM chat interactions.
What the Document Covers
The document follows the full journey in production order:
- Client-side: Live token counting via WASM tokenizers, IME composition events, optimistic UI rendering
- Network: Why SSE wins over WebSockets for chat, UTF-8 boundary problem in streaming
- API Gateway: Edge TLS termination, multi-dimensional rate limiting (RPM vs ITPM vs OTPM)
- Safety classifiers: What runs before and after the model, why prompt injection is structurally unsolved
- Context assembly: What actually goes into the context window (it's not just your messages)
- Tokenization: Why models can't count letters, why leading spaces matter, how special tokens consume budget
- KV cache and prefix caching: GQA vs MHA memory math, PagedAttention, cache hit rate as a cost lever
- Prefill vs decode: Why they're bottlenecked differently (compute vs memory bandwidth)
- Sampling pipeline: The full logit pipeline in order — repetition penalty, temperature, top-k, top-p, softmax, sample
- Streaming: TTFT breakdown, SSE event parsing, incremental markdown rendering
- Tool use and agentic loops: Parallel tool calls, prompt injection resurfacing in tool results
- Billing and observability: TTFT vs TPOT, cache pricing math, what to instrument
Document Details
The document is aimed at engineers who already understand transformers and want to see how production systems actually work. It's released under CC0 license, and contributions are welcome. The author notes several uncovered subsystems at the bottom including speculative decoding, multimodal systems, and multi-agent coordination.
The repository was created to address the gap between high-level "transformers are magic" explanations and academic papers that don't connect concepts to production system behavior.
📖 Read the full source: r/LocalLLaMA
👀 See Also

12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super
A Reddit user shares detailed speed benchmarks for Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma 4 26B, and Gemma 4 31B on a 12GB RTX 4070 Super using llama.cpp with optimized settings.

Vibe Coding Rules: Build Side Projects from Your Phone Using Claude Code Without Reading Code
A senior engineer shares their rules for building side projects entirely from a phone using Claude Code without reading code: start in plan mode, commit to git, write tests, use subagents for reviews, and auto-mode.

Building a Local Financial Data + Personal AI Rig on Mac Studio
A developer shares their journey building a fully localized financial data processing and personal AI assistant on a Mac Studio, including architecture decisions, memory split, cron orchestration, and first-setup optimizations.

Optimizing Qwen3.5-9B on RTX 3070 Mobile with ik_llama.cpp: Config Tweaks and Benchmarks
A developer shares optimization findings for running Qwen3.5-9B Q4_K_M on an RTX 3070 Mobile 8GB GPU using ik_llama.cpp, achieving ~50 tokens/second generation speed and significant prompt evaluation improvements through configuration adjustments.