LLM Architecture Gallery: Visual Reference for Model Designs

Sebastian Raschka's LLM Architecture Gallery is a collection of architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs, focusing specifically on architecture panels. The gallery includes clickable figures that enlarge for detail, with model titles linking to corresponding article sections.
Key Model Details
The gallery provides specific architectural specifications for numerous models:
- Llama 3 8B: 8B parameters, released 2024-04-18, dense decoder with GQA and RoPE attention, serves as pre-norm baseline
- OLMo 2 7B: 7B parameters, released 2024-11-25, dense decoder with MHA and QK-Norm, uses inside-residual post-norm instead of pre-norm
- DeepSeek V3: 671B total parameters (37B active), released 2024-12-26, sparse MoE decoder with MLA attention, uses dense prefix plus shared expert
- DeepSeek R1: 671B total parameters (37B active), released 2025-01-20, sparse MoE decoder with MLA attention, architecture matches DeepSeek V3 with reasoning-oriented training
- Gemma 3 27B: 27B parameters, released 2025-03-11, dense decoder with GQA and QK-Norm, uses 5:1 sliding-window/global attention ratio
- Mistral Small 3.1 24B: 24B parameters, released 2025-03-18, dense decoder with standard GQA, latency-focused design with smaller KV cache
- Llama 4 Maverick: 400B total parameters (17B active), released 2025-04-05, sparse MoE decoder with GQA attention, alternates dense and MoE blocks
- Qwen3 235B-A22B: 235B total parameters (22B active), released 2025-04-28, sparse MoE decoder with GQA and QK-Norm, optimized for serving efficiency without shared expert
- Qwen3 32B: 32B parameters, released 2025-04-28, dense decoder with GQA and QK-Norm, reference dense Qwen stack with 8 KV heads
- Qwen3 4B: 4B parameters, released 2025-04-28, dense decoder with GQA and QK-Norm, compact stack with 151k vocabulary
- Qwen3 8B: 8B parameters, released 2025-04-28, dense decoder with GQA and QK-Norm, reference Qwen3 dense stack with 8 KV heads
- SmolLM3 3B: 3B parameters, released 2025-06-19, dense decoder with GQA, experiments with periodic NoPE layers
Practical Features
The gallery includes an issue tracker for reporting inaccurate fact sheets, mislabeled architectures, or broken links. A physical poster version is available via Zazzle with a high-resolution export at 14570 x 12490 pixels (56 MB PNG file, 182 megapixels).
For developers working with AI coding agents, this resource provides concrete architectural details that can inform model selection, fine-tuning decisions, and performance optimization. The side-by-side comparison format makes it easier to understand trade-offs between different architectural choices.
📖 Read the full source: HN LLM Tools
👀 See Also

Building a Coding Agent for 8k Context: Planner/Executor Split, Token Budgeting, and Parallel Execution
A detailed breakdown of building a CLI coding agent designed around 8k token limits, using a planner/executor architecture, strict token budgeting, and parallel task execution.

mistral.rs Adds Support for Gemma 4 12B: Multimodal, Agentic, and MTP
mistral.rs now supports Gemma 4 12B with multimodal, agentic, and MTP integration. One-step install and run with web search, code execution, and built-in UI.

Dual-model architecture reduces token consumption by half for long conversations
A developer built a dual-model system where a small 'subconscious' model compresses conversation history in the background, allowing the main model to work with a curated ~35K context instead of 120K tokens of raw history. This architecture cuts token consumption roughly in half for sustained project work.

Building a Programming Language with Claude Code: The Cutlet Experiment
Ankur Sethi built a complete programming language called Cutlet using Claude Code over four weeks, with the AI generating every line of code while he focused on guardrails and testing. The language features dynamic typing, vectorized operations, and a REPL, running on macOS and Linux.