Fast LLM Inference: Anthropic vs OpenAI Comparison

Anthropic and OpenAI have recently introduced 'fast mode' features to enhance the speed of their language model inferences. These modes offer significantly improved token per second rates when interacting with their coding models but differ greatly in approach and capabilities.

Key Details

Anthropic's fast mode delivers up to 2.5x tokens per second, with an increase from Opus 4.6's 65 tokens to about 170. This enhancement is achieved by prioritizing low-batch-size inference. The tradeoff here involves paying more (six times the cost) for faster responses as the reduced batch size allows for quicker data processing, akin to a bus system that departs immediately without waiting to fill up, though this mode still runs on the actual Opus 4.6 model.

On the other hand, OpenAI showcases a markedly different approach, achieving more than 1000 tokens per second, which is 15x the previous rate of GPT-5.3-Codex's base 65 tokens per second. This is accomplished via their new model, GPT-5.3-Codex-Spark, which is purpose-built for speed by utilizing Cerebras chips. These chips, distinguished by their large size (70 square inches compared to a typical H100 chip's one square inch), provide ultra-low-latency compute by fitting entire models in their substantial internal memory.

While OpenAI's setup offers the substantial speed advantage of operating entirely in-memory with minimized data streaming delays, it does so with a compromise on model capability. GPT-5.3-Codex-Spark, despite its speed efficiency, is less capable than its vanilla counterpart, especially when it comes to managing more complex tasks or tool calls.

Who It's For

This comparison is particularly relevant for developers optimizing AI system performance and evaluates crucial aspects for those considering speed versus capability.

📖 Read the full source: HN LLM Tools

Comparative Overview of Fast LLM Inference by Anthropic and OpenAI

Key Details

Who It's For

👀 See Also

Claudlytics: Self-Hosted Dashboard for Tracking Claude Code Token Usage and Costs

OmniRecall Beta: FAISS-Powered Memory Injection for Cloud LLM Chats

AIMEAT: A Self-Hosted Protocol for AI Agents, Local LLMs, and Shared Capabilities

Benching local Qwen 3.6 27B as a Codex validator co-agent