Comparative Overview of Fast LLM Inference by Anthropic and OpenAI

Anthropic and OpenAI have recently introduced 'fast mode' features to enhance the speed of their language model inferences. These modes offer significantly improved token per second rates when interacting with their coding models but differ greatly in approach and capabilities.
Key Details
Anthropic's fast mode delivers up to 2.5x tokens per second, with an increase from Opus 4.6's 65 tokens to about 170. This enhancement is achieved by prioritizing low-batch-size inference. The tradeoff here involves paying more (six times the cost) for faster responses as the reduced batch size allows for quicker data processing, akin to a bus system that departs immediately without waiting to fill up, though this mode still runs on the actual Opus 4.6 model.
On the other hand, OpenAI showcases a markedly different approach, achieving more than 1000 tokens per second, which is 15x the previous rate of GPT-5.3-Codex's base 65 tokens per second. This is accomplished via their new model, GPT-5.3-Codex-Spark, which is purpose-built for speed by utilizing Cerebras chips. These chips, distinguished by their large size (70 square inches compared to a typical H100 chip's one square inch), provide ultra-low-latency compute by fitting entire models in their substantial internal memory.
While OpenAI's setup offers the substantial speed advantage of operating entirely in-memory with minimized data streaming delays, it does so with a compromise on model capability. GPT-5.3-Codex-Spark, despite its speed efficiency, is less capable than its vanilla counterpart, especially when it comes to managing more complex tasks or tool calls.
Who It's For
This comparison is particularly relevant for developers optimizing AI system performance and evaluates crucial aspects for those considering speed versus capability.
📖 Read the full source: HN LLM Tools
👀 See Also

MCP Support Merged into llama.cpp with New WebUI Features
The Model Context Protocol (MCP) pull request for llama.cpp has been merged, adding MCP support, tool calls, an agentic loop, and a server selector to the llama-server/WebUI side.
TextGen (text-generation-webui) Becomes Native Desktop App with Portable Builds
TextGen, the open-source alternative to LM Studio, has evolved from a web UI to a no-install desktop app for Windows, Linux, and macOS with portable builds, full privacy, and advanced quantization support.

Spectr: An MCP That Writes App Specs from Screen Recordings for Pixel-Perfect Claude Clones
Spectr is an MCP server, CLI, and Claude Code skill that takes an .mp4/.mov screen recording of an iOS app and generates a 7-section spec.md with hex codes, font weights, spacing, transitions, and nav graph — eliminating the 30-minute manual spec writing per screen.

Pair Programmer Plugin Adds Live Screen, Voice, and Audio Context to Claude Code
A developer has built a plugin called Pair Programmer that gives Claude Code real-time desktop perception by capturing screen, microphone, and system audio streams. The architecture uses specialized agents running in parallel for different input types, with indexing currently handled by cloud models but designed to be model-agnostic.