Bodega Inference Engine: Optimizing LLM Inference for Apple Silicon's Unified Memory

Bodega is an inference engine designed specifically for Apple Silicon's unified memory architecture, built over 2.5 years with optimizations close to the Metal layer on MLX. It addresses the fundamental throughput limitations developers face when running LLMs on Mac hardware.
Why Apple Silicon Requires Different Optimization
Apple Silicon uses unified memory where CPU, GPU, and neural engine share one physical pool over a single on-chip bus. This differs fundamentally from discrete GPUs like NVIDIA's which have separate VRAM and system RAM pools connected by PCIe. Memory bandwidth ranges from ~400 GB/s on M1 Max to ~800 GB/s on M3 Ultra (with cross-die penalty reducing actual throughput to 1.6-1.8x single-die performance).
Key architectural implications:
- Decode is memory-bandwidth-bound - each token requires loading model weights from the shared bus
- Prefill is compute-bound - dominated by GPU TFLOPS for matrix-matrix multiplication
- The memory bus is shared with everything - KV cache, model weights, OS, and applications all compete for the same 400-800 GB/s bandwidth
This architecture makes direct ports of vLLM or llama.cpp's batching implementations ineffective on MLX, as they were designed for different memory architectures.
What Bodega Builds
The developer studied vLLM's core internals including continuous batching, speculative decoding, chunked prefill, and prefix caching, then rebuilt every component for MLX and Apple's unified memory model.
The core insight for continuous batching: generating a single token for a single sequence loads the full model weights for a matrix-vector multiply, which is wasteful on hardware with 400+ GB/s bandwidth. The solution runs multiple sequences simultaneously using weights × matrix of vectors instead of weights × single vector.
KV cache management was redesigned for unified memory where evicting cache blocks has different cost implications compared to isolated VRAM systems.
Practical Implications
The developer reports testing on multiple Apple Silicon configurations including two M3 Ultras (256GB and 512GB), an M4 Max 128GB, and an M1 Max 64GB. The common ceiling identified is single-user throughput with one request at a time and GPU sitting mostly idle.
The repository includes benchmarks that can be verified with a simple curl script for setup.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Kubeez MCP Server Connects Claude to 70+ AI Media Models
Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.

Flue: A TypeScript Framework for Building Autonomous Coding Agents
Flue is a TypeScript framework that provides a programmable harness for building autonomous agents, featuring skills, sessions, sandboxed shell execution, and a built-in virtual sandbox. It can replace tools like Dosu, Greptile, CodeRabbit, Devin, and Claude Code with custom agent logic.
ClaudeAI Brainstorming Mode Gets Visual Companion for Mockups and UI Approval
A user discovers a new 'Visual companion' feature in ClaudeAI brainstorming mode that serves mockups on a local web server, enabling back-and-forth UI tweaks before building.

AI Roundtable: Tool for Comparing 200+ AI Models on Structured Questions
AI Roundtable is a free tool that lets users pose questions with defined answer options, select up to 50 models from a pool of 200+, and get structured responses under identical conditions. It also includes a debate feature where models can see each other's reasoning and a reviewer model that summarizes transcripts.