Practical Lessons from Building On-Device AI in React Native

Text Generation with LLMs
Use llama.rn for running GGUF models in React Native. It wraps llama.cpp and provides native bindings for Android (JNI) and iOS (Metal). Streaming tokens via callbacks works well.
Memory management is critical: a 7B Q4 model needs ~5.5GB of RAM at runtime (file size × 1.5 for KV cache and activations). Use 60% of device RAM as a hard budget, warn at 50%, and block at 60% to prevent OS app kills.
GPU acceleration uses OpenCL on Android (Adreno GPUs) and Metal on iOS. Flash attention crashes with GPU layers > 0 on Android, so enforce this in code. KV cache quantization (f16/q8_0/q4_0) is a bigger win than GPU for most devices; going from f16 to q4_0 roughly tripled inference speed in testing.
Image Generation with Stable Diffusion
This is platform-specific with no single library covering both.
- Android: Use MNN (Alibaba's framework, CPU, works on all ARM64 devices) and QNN (Qualcomm AI Engine, NPU-accelerated, Snapdragon 8 Gen 1+ only). QNN is 3× faster but only works on recent Qualcomm chips. Implement runtime detection with automatic fallback.
- iOS: Use Apple's ml-stable-diffusion pipeline with Core ML and Neural Engine acceleration. Palettized models (~1GB, 6-bit) are great for memory-constrained devices; full precision (~4GB, fp16) is faster on ANE but needs headroom.
Real-world benchmarks: 5–10 seconds on Snapdragon NPU, 15 seconds CPU on flagship, 8–15 seconds iOS ANE for 512×512 at 20 steps. Show real-time preview every N denoising steps to prevent users thinking the app froze.
Voice Transcription with Whisper
whisper.rn wraps whisper.cpp and is straightforward to integrate. Offer multiple model sizes (Tiny/Base/Small) and let users pick speed vs. accuracy tradeoffs. Real-time partial transcription (words appearing as they speak) makes it feel native.
Buffer audio in native code and clear it after transcription; don't write audio files to disk if privacy matters.
Vision with Multimodal Models
Vision models need two files: the main GGUF and an mmproj (multimodal projector) companion. Handle this transparently: auto-detect vision models, auto-download the mmproj, track them as a single unit, and search the model directory at runtime if the link breaks. Download both files in parallel to cut download time nearly in half for a 2B vision model.
SmolVLM at 500M is the sweet spot for mobile, with ~7 seconds on flagship devices, capable for document reading and scene description.
Tool Calling for On-Device Agent Loops
Models that support function calling can use tools (web search, calculator, date/time, device info) through an automatic loop: LLM generates, parse for tool calls, execute them, inject results back into context, LLM continues. Cap it at max 3 iterations, 5 total calls to prevent infinite loops.
Support two parsing paths: larger models output structured JSON tool calls natively through llama.rn, while smaller models output XML like <tool_call>. Detect tool support at model load time by inspecting the jinja chat template; if the model doesn't support tools, don't inject tool definitions into the system prompt to avoid hallucinations. The calculator uses a recursive descent parser—never eval().
Intent Classification
If your app does both text and image generation, you need to decide what the user wants based on input analysis.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Practical Framework for Choosing Between Claude's Haiku, Sonnet, and Opus Models
A developer tested Claude's three models on a 400-line Express.js refactoring task and found the key difference is reasoning depth, not intelligence. Haiku 4.5 handled straightforward parts but missed middleware ordering, Sonnet 4.6 caught the ordering issue and added TypeScript types, while Opus 4.6 identified a security flaw in auth middleware.

V100 SXM2 NVLink Homelab Guide: Building 64GB Unified VRAM for ~$1,100
A comprehensive guide details how to build a V100 SXM2 homelab with 64GB of NVLink-unified VRAM for approximately $1,100 using reverse-engineered Chinese hardware, covering hardware sourcing, performance estimates, and software compatibility.

Agent-Oriented API Design Patterns: Insights from Moltbook
Moltbook's API design supports proactive AI agent interactions by integrating direct instruction, state transitions, cognitive challenges, and educational rate-limiting.

Bug Hunt: WireGuard Crashes and MTU Mismatch in GKE
Lovable engineers traced user errors to anetd crashes from a concurrent map access panic in Google's WireGuard integration, then found a secondary MTU mismatch after disabling encryption.