End-to-End LLM Stack Trace: From Keystroke to Streamed Token

✍️ OpenClawRadar📅 Published: March 19, 2026🔗 Source
End-to-End LLM Stack Trace: From Keystroke to Streamed Token
Ad

A software engineer has published a detailed technical document that traces exactly what happens at every layer of the stack when you send a prompt to an LLM like Claude or ChatGPT. Inspired by the classic "what-happens-when" repository for browser navigation, this document provides a production systems perspective on LLM chat interactions.

What the Document Covers

The document follows the full journey in production order:

  • Client-side: Live token counting via WASM tokenizers, IME composition events, optimistic UI rendering
  • Network: Why SSE wins over WebSockets for chat, UTF-8 boundary problem in streaming
  • API Gateway: Edge TLS termination, multi-dimensional rate limiting (RPM vs ITPM vs OTPM)
  • Safety classifiers: What runs before and after the model, why prompt injection is structurally unsolved
  • Context assembly: What actually goes into the context window (it's not just your messages)
  • Tokenization: Why models can't count letters, why leading spaces matter, how special tokens consume budget
  • KV cache and prefix caching: GQA vs MHA memory math, PagedAttention, cache hit rate as a cost lever
  • Prefill vs decode: Why they're bottlenecked differently (compute vs memory bandwidth)
  • Sampling pipeline: The full logit pipeline in order — repetition penalty, temperature, top-k, top-p, softmax, sample
  • Streaming: TTFT breakdown, SSE event parsing, incremental markdown rendering
  • Tool use and agentic loops: Parallel tool calls, prompt injection resurfacing in tool results
  • Billing and observability: TTFT vs TPOT, cache pricing math, what to instrument
Ad

Document Details

The document is aimed at engineers who already understand transformers and want to see how production systems actually work. It's released under CC0 license, and contributions are welcome. The author notes several uncovered subsystems at the bottom including speculative decoding, multimodal systems, and multi-agent coordination.

The repository was created to address the gap between high-level "transformers are magic" explanations and academic papers that don't connect concepts to production system behavior.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also