Hypura: Storage-Tier-Aware LLM Inference on Apple Silicon

What Hypura does

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities. This enables models that exceed physical memory to run without crashing the system.

Key features and how it works

Hypura reads GGUF files, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

GPU (Metal) — Attention layers, norms, embeddings
RAM — Overflow layers that don't fit in the GPU working set, accessed via mmap
NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass

For MoE models like Mixtral, Hypura implements expert-streaming: only non-expert tensors (~1 GB) stay on GPU, while expert tensors stream from NVMe through a pool buffer on demand. It includes a neuron cache with 99.5% hit rate that eliminates most I/O after warmup, router interception to identify selected experts, and co-activation tracking to predict which experts will fire next for speculative prefetch.

For dense models like Llama 70B, it uses dense FFN-streaming: attention + norms stay on GPU (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer with scaled prefetch lookahead.

Performance benchmarks

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read:

Qwen 2.5 14B Q4_K_M (8.4 GB): Full-resident mode, 21 tok/s (same as llama.cpp)
Mixtral 8x7B Q5_K_M (30.9 GB): Expert-streaming mode, 2.2 tok/s (llama.cpp OOM)
Llama 3.3 70B Q4_K_M (39.6 GB): Dense-FFN-streaming mode, 0.3 tok/s (llama.cpp OOM)

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

Installation

Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake.

📖 Read the full source: HN AI Agents

Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon

What Hypura does

Key features and how it works

Performance benchmarks

Installation

👀 See Also

TigrimOS v1.1.0 and Tiger CoWork v0.5.0 released with remote agent swarms and configurable governance

OpenClawDreams: A Dream Simulator Extension for OpenClaw Agents

Multi-LLM Paper-Trading Bot with Claude Opus as Lead Engineer and Gemini as Strategist: Architecture Breakdown

CodeLedger: Open-source Claude Code plugin tracks token usage and background agents