Exploring Mistral Voxtral Realtime 4B in Pure C for Speech-to-Text

The Mistral Voxtral Realtime 4B is a speech-to-text model implemented in pure C, providing a dependency-free alternative to those relying exclusively on the C standard library. The repository, voxtral.c by antirez, facilitates the inference pipeline without requiring Python runtime, CUDA toolkit, or any other external library at inference time.
Key Features
- Pure C Implementation: No external dependencies beyond the C standard library are required, making it suitable for environments where minimal dependency is critical.
- Platform Specific Backends: Offers two make targets:
make mpsfor Apple Silicon which provides faster processing, andmake blasfor Intel Mac or Linux systems equipped with OpenBLAS, albeit with slower performance due to conversion needs from bf16 to fp32. - Audio Processing: Utilizes a chunked encoder with overlapping windows to bound memory usage, irrespective of input length. It also allows audio input through stdin or microphone on macOS, enhancing its versatility for live or file-based transcription tasks.
- Streaming C API: The API,
vox_stream_t, permits incremental audio feeding and outputs token strings as they are generated.
Usage
- Download the model (~8.9GB) using
./download_model.sh. - For audio transcription from a file:
./voxtral -d voxtral-model -i audio.wav. - Live transcription from a mic on macOS:
./voxtral -d voxtral-model --from-mic. - Transcoding and transcription with
ffmpeg:ffmpeg -i audio.mp3 -f s16le -ar 16000 -ac 1 - 2> /dev/null | ./voxtral -d voxtral-model --stdin.
The project is open to further testing, as it currently relies on limited samples. Full production readiness might require more work, particularly in handling long transcriptions to test the KV cache's circular buffer.
📖 Read the full source: HN AI Agents
👀 See Also

Agent-Desktop: Structured Desktop Automation via OS Accessibility Trees
Agent-desktop is a cross-platform CLI (Rust binary, ~15 MB) that exposes 53 commands with JSON output for inspecting and operating native apps through OS accessibility APIs — no screenshots or vision models needed. It uses progressive skeleton traversal to reduce token usage by 78-96% on dense apps like Slack or VS Code.

Qhatu: Platform Turns GitHub Repos into Pay-Per-Use Micro SaaS with Claude
Qhatu is a platform that takes a GitHub repository and deploys it as a pay-per-use micro SaaS with a generated frontend and integrated payment processing. The system uses Anthropic APIs to analyze code, generate Dockerfiles, and create storefront UIs.
Claudy: A native macOS wrapper for Claude Code with multi-session, auto account switching, and draft commits
Claudy is a native macOS app built with SwiftUI + SwiftData that wraps Claude Code, adding multi-session management, automatic account switching on rate limits, draft commits for mid-session checkpoints, and a marketplace for Skills, MCPs, and Commands.

Screenbox: Open-Source Virtual Desktops for AI Agents Built Entirely by Voice
Screenbox provides isolated Linux desktops in Docker for AI agents, solving conflicts when multiple agents run in parallel. The project was built entirely using voice commands with Claude Code, and the creator hasn't seen a single line of the code.