Building a Sub-500ms Voice Agent: Architecture and Performance Insights

✍️ OpenClawRadar📅 Published: March 3, 2026🔗 Source
Building a Sub-500ms Voice Agent: Architecture and Performance Insights
Ad

Voice Agent Architecture and Performance

Nick Tikhonov built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). This includes full STT → LLM → TTS in the loop with clean barge-ins and no precomputed responses. The implementation outperformed Vapi's equivalent setup by 2× on latency.

Core Technical Insights

The key realization was that voice is a turn-taking problem, not a transcription problem. Voice Activity Detection (VAD) alone fails; semantic end-of-turn detection is required. The system reduces to one loop with two states: speaking vs listening.

The critical transitions are:

  • Cancel instantly on barge-in
  • Respond instantly on end-of-turn

Technical Requirements

STT → LLM → TTS must stream. Sequential pipelines are ineffective for natural conversation. Time To First Token (TTFT) dominates everything in voice interfaces - the first token is the critical path. Groq's ~80ms TTFT was identified as the single biggest performance win.

Infrastructure Considerations

Geography matters more than prompts. All components must be colocated or latency becomes prohibitive before the system even starts processing. The build took approximately one day and roughly $100 in API credits.

Ad

Why Voice Agents Are Challenging

Voice agents represent a significant complexity increase compared to text agents. The orchestration is continuous and real-time, requiring careful management of multiple models simultaneously. The system must constantly decide whether the user is speaking or listening, with transitions between these states being the most difficult aspect.

When the user starts speaking, the agent must immediately stop talking - cancel generation, cancel speech synthesis, and flush any buffered audio. When the user stops speaking, the system must confidently decide they're done and start responding with minimal delay.

Architecture Approach

The developer started by iterating on architecture with ChatGPT outside the editor to build a mental model first. The entire problem was reduced to a single loop and a tiny state machine. The core question a voice agent needs to answer is: is the user speaking, or listening?

The two states are:

  • The user is speaking
  • The user is listening

This turn-detection logic forms the core of every voice system. The implementation is available on GitHub for reference and further development.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

claude-powerline v1.20 adds TUI dashboard mode, context bar styles, and environment variable display
Tools

claude-powerline v1.20 adds TUI dashboard mode, context bar styles, and environment variable display

claude-powerline v1.20 introduces a TUI dashboard mode that replaces the single statusline with a full panel showing model info, context usage with progress bar, costs, git status, and more. The update adds 9 visual progress bar styles for context usage and environment variable display capability.

OpenClawRadar
Open Source Second Brain System Built on Claude Code for Task Management
Tools

Open Source Second Brain System Built on Claude Code for Task Management

An open source system called Kipi System uses Claude Code to track open threads, draft follow-ups, and manage tasks by pulling from calendar, email, CRM, and social feeds. It generates a daily HTML file with pre-written actions sorted by friction.

OpenClawRadar
altRAG: Replace Vector DB RAG with 2KB Pointer Files for AI Coding Agents
Tools

altRAG: Replace Vector DB RAG with 2KB Pointer Files for AI Coding Agents

altRAG is a Python tool that replaces vector database RAG with lightweight pointer files. It scans Markdown/YAML skill files to create a 2KB skeleton file mapping sections to exact line numbers and byte offsets, allowing AI agents to read only needed sections instead of entire files.

OpenClawRadar
CLI-Anything-WEB: Open-source plugin that reverse-engineers any website into a Python CLI for Claude Code
Tools

CLI-Anything-WEB: Open-source plugin that reverse-engineers any website into a Python CLI for Claude Code

CLI-Anything-WEB is an open-source Claude Code plugin that watches your browser traffic, reverse-engineers the protocol, and generates a full Python CLI with auth, tests, and --json support. 19 sample CLIs included for sites like Reddit, Booking, Airbnb, ChatGPT, and LinkedIn.

OpenClawRadar