RunAnywhere RCLI: On-Device Voice AI Pipeline for Apple Silicon

What RCLI Does
RCLI is a complete voice AI pipeline that runs speech-to-text, large language model inference, and text-to-speech entirely on-device on Apple Silicon Macs. It requires macOS 13+ on M1 or later chips and operates without cloud services or API keys.
Installation and Setup
Install via Homebrew:
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup # downloads ~1 GB of models
Or using curl:
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
Performance Claims
The developers benchmarked on an M4 Max with 64GB RAM and report:
- LLM decode: 1.67x faster than llama.cpp, 1.19x faster than Apple MLX
- Qwen3-0.6B: 658 tokens/sec (vs mlx-lm 552, llama.cpp 295)
- Qwen3-4B: 186 tokens/sec (vs mlx-lm 170, llama.cpp 87)
- Time-to-first-token: 6.6 ms
- STT: 70 seconds of audio transcribed in 101 ms (714x real-time, 4.6x faster than mlx-whisper)
- TTS: 178 ms synthesis (2.8x faster than mlx-audio and sherpa-onnx)
Key Features
- Three concurrent threads with lock-free ring buffers
- Double-buffered TTS (next sentence renders while current plays)
- 38 macOS actions controllable by voice
- Local RAG with ~4 ms retrieval over 5K+ document chunks
- 20 hot-swappable models
- Full-screen TUI with per-operation latency readouts
- Falls back to llama.cpp when MetalRT isn't installed
Voice Pipeline Components
- VAD: Silero voice activity detection
- STT: Zipformer streaming + Whisper/Parakeet offline
- LLM: Qwen3/LFM2/Qwen3.5 with KV cache continuation and Flash Attention
- TTS: Double-buffered sentence-level synthesis
- Tool Calling: LLM-native tool call formats
- Multi-turn Memory: Sliding window conversation history with token-budget trimming
Usage Commands
rcli # interactive TUI with push-to-talk
rcli listen # continuous voice mode
rcli ask "open Safari" # one-shot command
rcli rag ingest ~/Documents/notes # index documents for RAG
rcli ask --rag ~/Library/RCLI/index "summarize the project plan"
TUI Controls
- SPACE: Push-to-talk
- M: Models browser for downloading and hot-swapping LLM/STT/TTS
- A: Actions browser to enable/disable macOS actions
- B: Run STT, LLM, TTS, and end-to-end benchmarks
- R: RAG document ingestion
- X: Clear conversation and reset context
- T: Toggle tool call trace
- ESC: Stop/close/quit
MetalRT Engine Details
MetalRT is RunAnywhere's proprietary GPU inference engine that uses Metal 3.1 features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is planned. The engine uses custom Metal compute shaders for quantized matmul, attention, and activation operations, compiled ahead of time and dispatched directly to the GPU with zero allocations during inference.
macOS Actions
RCLI includes 43 macOS actions across categories:
- Productivity: create_note, create_reminder, run_shortcut
- Communication: send_message, facetime_call
- Media: play_on_spotify, play_apple_music, play_pause, next_track, set_music_volume
- System: open_app, quit_app, set_volume, toggle_dark_mode, screenshot, lock_screen
- Web: search_web, search_youtube, open_url, open_maps
📖 Read the full source: HN AI Agents
👀 See Also

Spectral: Capture App Traffic to Generate MCP Servers for OpenClaw Agents
Spectral is an open-source tool that captures traffic from any application, analyzes it with an LLM, and generates a working MCP server, allowing OpenClaw agents to call the app's real API directly instead of relying on browser automation.

SprintiQ: Open-Source Sprint Planning for Claude Code
SprintiQ is an open-source agile platform that acts as an orchestration layer for Claude Code, offering AI-powered user story generation, sprint planning, velocity tracking, and a CLI that syncs git activity to sprints in real time.

wearehere browser extension scans sites for tracking and privacy risks
wearehere is a browser extension that scans websites across ten categories including cookies, trackers, device fingerprinting, and dark patterns, then scores them based on privacy risks. It's under 200KB, runs locally in the browser, and also comes as an npm package for integration with AI agents via barebrowse MCP server.
Voker Launches Agent Analytics Platform with Intent/Correction/Resolution Primitives
YC S24 startup Voker launches an agent analytics platform with a lightweight SDK that automatically annotates user intents, corrections, and resolutions — providing self-service dashboards without relying on LLMs for data engineering.