End-to-End LLM Stack Trace: From Keystroke to Streamed Token

A software engineer has published a detailed technical document that traces exactly what happens at every layer of the stack when you send a prompt to an LLM like Claude or ChatGPT. Inspired by the classic "what-happens-when" repository for browser navigation, this document provides a production systems perspective on LLM chat interactions.
What the Document Covers
The document follows the full journey in production order:
- Client-side: Live token counting via WASM tokenizers, IME composition events, optimistic UI rendering
- Network: Why SSE wins over WebSockets for chat, UTF-8 boundary problem in streaming
- API Gateway: Edge TLS termination, multi-dimensional rate limiting (RPM vs ITPM vs OTPM)
- Safety classifiers: What runs before and after the model, why prompt injection is structurally unsolved
- Context assembly: What actually goes into the context window (it's not just your messages)
- Tokenization: Why models can't count letters, why leading spaces matter, how special tokens consume budget
- KV cache and prefix caching: GQA vs MHA memory math, PagedAttention, cache hit rate as a cost lever
- Prefill vs decode: Why they're bottlenecked differently (compute vs memory bandwidth)
- Sampling pipeline: The full logit pipeline in order — repetition penalty, temperature, top-k, top-p, softmax, sample
- Streaming: TTFT breakdown, SSE event parsing, incremental markdown rendering
- Tool use and agentic loops: Parallel tool calls, prompt injection resurfacing in tool results
- Billing and observability: TTFT vs TPOT, cache pricing math, what to instrument
Document Details
The document is aimed at engineers who already understand transformers and want to see how production systems actually work. It's released under CC0 license, and contributions are welcome. The author notes several uncovered subsystems at the bottom including speculative decoding, multimodal systems, and multi-agent coordination.
The repository was created to address the gap between high-level "transformers are magic" explanations and academic papers that don't connect concepts to production system behavior.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Practical Multi-Agent System Architecture Advice from Experience
A developer shares five specific patterns for building multi-agent AI systems based on experience running a 7-agent daily system: start with one agent, use an orchestrator pattern, implement shared memory with JSON files, route models by task, and add confirmation loops.

Three-layer memory architecture for persistent OpenClaw agent context
A developer built a 3-layer memory system on top of OpenClaw's infrastructure to prevent agents from starting each session without context. The architecture includes L1 workspace files injected every turn, L2 semantic memory search, and L3 reference documents opened on demand.

OpenClaw: Your Ultimate Quick Reference Cheatsheet
Dive into the nitty-gritty of OpenClaw with our handy reference cheatsheet. Extract critical features and functionalities to streamline your AI coding experience.

Resolving Disconnection Issues in OpenClaw Control UI
Learn how to solve the 'Disconnected (1008): control ui requires HTTPS or localhost' error when using OpenClaw on a Hostinger VPS.