Building On-Device AI in React Native: Tips & Benchmarks

Text Generation with LLMs

Use llama.rn for running GGUF models in React Native. It wraps llama.cpp and provides native bindings for Android (JNI) and iOS (Metal). Streaming tokens via callbacks works well.

Memory management is critical: a 7B Q4 model needs ~5.5GB of RAM at runtime (file size × 1.5 for KV cache and activations). Use 60% of device RAM as a hard budget, warn at 50%, and block at 60% to prevent OS app kills.

GPU acceleration uses OpenCL on Android (Adreno GPUs) and Metal on iOS. Flash attention crashes with GPU layers > 0 on Android, so enforce this in code. KV cache quantization (f16/q8_0/q4_0) is a bigger win than GPU for most devices; going from f16 to q4_0 roughly tripled inference speed in testing.

Image Generation with Stable Diffusion

This is platform-specific with no single library covering both.

Android: Use MNN (Alibaba's framework, CPU, works on all ARM64 devices) and QNN (Qualcomm AI Engine, NPU-accelerated, Snapdragon 8 Gen 1+ only). QNN is 3× faster but only works on recent Qualcomm chips. Implement runtime detection with automatic fallback.
iOS: Use Apple's ml-stable-diffusion pipeline with Core ML and Neural Engine acceleration. Palettized models (~1GB, 6-bit) are great for memory-constrained devices; full precision (~4GB, fp16) is faster on ANE but needs headroom.

Real-world benchmarks: 5–10 seconds on Snapdragon NPU, 15 seconds CPU on flagship, 8–15 seconds iOS ANE for 512×512 at 20 steps. Show real-time preview every N denoising steps to prevent users thinking the app froze.

Voice Transcription with Whisper

whisper.rn wraps whisper.cpp and is straightforward to integrate. Offer multiple model sizes (Tiny/Base/Small) and let users pick speed vs. accuracy tradeoffs. Real-time partial transcription (words appearing as they speak) makes it feel native.

Buffer audio in native code and clear it after transcription; don't write audio files to disk if privacy matters.

Vision with Multimodal Models

Vision models need two files: the main GGUF and an mmproj (multimodal projector) companion. Handle this transparently: auto-detect vision models, auto-download the mmproj, track them as a single unit, and search the model directory at runtime if the link breaks. Download both files in parallel to cut download time nearly in half for a 2B vision model.

SmolVLM at 500M is the sweet spot for mobile, with ~7 seconds on flagship devices, capable for document reading and scene description.

Tool Calling for On-Device Agent Loops

Models that support function calling can use tools (web search, calculator, date/time, device info) through an automatic loop: LLM generates, parse for tool calls, execute them, inject results back into context, LLM continues. Cap it at max 3 iterations, 5 total calls to prevent infinite loops.

Support two parsing paths: larger models output structured JSON tool calls natively through llama.rn, while smaller models output XML like <tool_call>. Detect tool support at model load time by inspecting the jinja chat template; if the model doesn't support tools, don't inject tool definitions into the system prompt to avoid hallucinations. The calculator uses a recursive descent parser—never eval().