Voxray-AI: Production Go Backend for Real-Time Voice Pipelines

Production Voice Agent Pipeline in Go

Voxray-AI provides a complete streaming pipeline in Go that handles client audio through WebSocket or WebRTC, processes it through STT → LLM → TTS, and returns audio output. The system is designed for production-grade servers and high-concurrency voice workloads.

Transport Options

The system supports multiple transport mechanisms:

WebSocket at /ws with RTVI serializer (?rtvi=1) and Protobuf (?format=protobuf) support
WebRTC at /webrtc/offer with full SDP offer/answer, configurable STUN/TURN, and Opus encoding (requires CGO build)
Telephony runner transports: Twilio, Telnyx, Plivo, Exotel, LiveKit, Daily.co

Pluggable Providers

All components are swappable via configuration:

STT providers: OpenAI, Groq, Sarvam, Google, AWS
LLM providers: OpenAI, Anthropic, Groq, others
TTS providers: OpenAI, Google, AWS Polly, Sarvam

Configuration Examples

Minimal configuration example:

{"transport": "both", "stt": { "provider": "groq", "model": "whisper-large-v3" }, "llm": { "provider": "anthropic", "model": "claude-3-5-haiku" }, "tts": { "provider": "google", "voice": "en-US-Neural2-F" }}

Turn-taking and voice activity detection configuration:

{"turn_detection": "silence", "vad_type": "silero", "vad_confidence": 0.7, "vad_start_secs_vad": 0.2, "vad_stop_secs": 0.8, "turn_max_duration_secs": 30, "user_idle_timeout_secs": 60}

Observability & Storage

/metrics endpoint for Prometheus (request counts, latency histograms, active connection gauges)
Recording: Full session audio to S3 with configurable worker pool and format
Transcripts: Per-message storage to Postgres or MySQL with configurable table
/health and /ready endpoints with optional Redis session store check on /ready

Security Features

server_api_key gates /ws, /webrtc/offer, /start, /sessions/* via Authorization: Bearer or X-API-Key
CORS allowlist configuration
TLS cert/key configuration
12-factor style: JSON config + environment variable overrides

This type of backend is useful for developers building real-time voice applications that need to integrate multiple AI services with production-ready infrastructure.

📖 Read the full source: r/LocalLLaMA