NerfGuard: A Classifier That Routes Coding Requests to Cheaper Models, Cutting Spend 3x

A team that switched from Claude Code to Codex for speed and steerability found themselves hitting per-token pricing hard. Their daily bill was striking, and they noticed they were using top-tier models on max reasoning for every task, even trivial ones. So they built NerfGuard — a fast classifier that routes each request to the least expensive model and reasoning depth required.
The core is a classifier that determines the minimum intelligence needed for a given coding request. On top of that, it applies automated token efficiency techniques. The result: roughly the same quality for multiples lower token spend, and because intelligence and reasoning are properly bin-packed, speed also goes up considerably. The team observed up to 3x savings and hours per day per person saved waiting on tool turns and agent responses.
Key details from the source:
- Classifier routes to cheapest model + reasoning depth for each request
- Additional automatic token efficiency techniques
- Result: 3x usage for same spend
- Speed improvements: hours per day per person saved
- More usage before hitting throttling limits
This is currently in use by engineers at multiple AI companies. The tool is available at nerfguard.com.
Who it's for: Teams using coding agents (Claude Code, Codex, etc.) who want to maximize output per dollar and reduce wait times.
📖 Read the full source: HN AI Agents
👀 See Also

Building a Sub-500ms Voice Agent: Architecture and Performance Insights
A developer built a voice agent from scratch achieving ~400ms end-to-end latency with full STT → LLM → TTS streaming. Key insights include treating voice as a turn-taking problem, using semantic end-of-turn detection, and colocating all components for minimal latency.

Open-source Claude skill for management consulting frameworks and case studies
A free, MIT-licensed Claude skill provides structured reference material for management consulting work, including frameworks, industry context, and case studies. The project consists of 80+ markdown files organized by domain and seeks contributors to expand coverage.

OpenClaw memory loss fix using Mem0 plugin
OpenClaw agents experience memory loss due to context compaction rewriting files like MEMORY.md. The Mem0 plugin solves this by moving memory outside the context window with auto-recall and auto-capture features.

LoreConvo: MCP Server Adds Persistent Session Memory to Claude Code
LoreConvo is an MCP server that provides Claude Code with persistent session memory, automatically saving and loading context between sessions. It saves 3,000-8,000 tokens per session by eliminating re-contexting overhead.