Building a Sub-500ms Voice Agent: Architecture and Performance Insights

Voice Agent Architecture and Performance
Nick Tikhonov built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). This includes full STT → LLM → TTS in the loop with clean barge-ins and no precomputed responses. The implementation outperformed Vapi's equivalent setup by 2× on latency.
Core Technical Insights
The key realization was that voice is a turn-taking problem, not a transcription problem. Voice Activity Detection (VAD) alone fails; semantic end-of-turn detection is required. The system reduces to one loop with two states: speaking vs listening.
The critical transitions are:
- Cancel instantly on barge-in
- Respond instantly on end-of-turn
Technical Requirements
STT → LLM → TTS must stream. Sequential pipelines are ineffective for natural conversation. Time To First Token (TTFT) dominates everything in voice interfaces - the first token is the critical path. Groq's ~80ms TTFT was identified as the single biggest performance win.
Infrastructure Considerations
Geography matters more than prompts. All components must be colocated or latency becomes prohibitive before the system even starts processing. The build took approximately one day and roughly $100 in API credits.
Why Voice Agents Are Challenging
Voice agents represent a significant complexity increase compared to text agents. The orchestration is continuous and real-time, requiring careful management of multiple models simultaneously. The system must constantly decide whether the user is speaking or listening, with transitions between these states being the most difficult aspect.
When the user starts speaking, the agent must immediately stop talking - cancel generation, cancel speech synthesis, and flush any buffered audio. When the user stops speaking, the system must confidently decide they're done and start responding with minimal delay.
Architecture Approach
The developer started by iterating on architecture with ChatGPT outside the editor to build a mental model first. The entire problem was reduced to a single loop and a tiny state machine. The core question a voice agent needs to answer is: is the user speaking, or listening?
The two states are:
- The user is speaking
- The user is listening
This turn-detection logic forms the core of every voice system. The implementation is available on GitHub for reference and further development.
📖 Read the full source: HN AI Agents
👀 See Also

Freddy MCP Server Connects Wearables to AI Agents with Headless Sign-In
Freddy is a personal MCP server that links wearables (Polar, Oura, Withings, Suunto, Intervals.icu, Hevy, plus WHOOP, Strava, Dexcom in beta) to AI clients like Claude Code, ChatGPT, and Notion AI via OAuth. New headless sign-in enables scheduled workflows for autonomous agents.

Deterministic Compiler Architecture for Multi-Step LLM Workflows Shows Strong Benchmark Results
A deterministic compilation architecture for structured LLM workflows uses typed node registries, parameter contracts, and static validation to compile workflow graphs ahead of time. Benchmarks show it outperforms GPT-4.1 and Claude Sonnet 4.6 across workflow depths from 3-12+ nodes.

Dart AI productivity app review with OpenClaw integration
A user reports switching from Things to Dart AI for productivity, finding it better for implementing Getting Things Done methodology with full OpenClaw access, despite UI issues and initial setup complexity.

T9OS: An AI Orchestration System Built Entirely with Claude Code
An economics student built T9OS, a complete AI orchestration layer using Claude Code as the only programming tool. The system includes 18 production pipelines, a 12-state lifecycle engine, and 7 AI 'Guardians' that review every output.