Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon

✍️ OpenClawRadar📅 Published: March 24, 2026🔗 Source
Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon
Ad

What Hypura does

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities. This enables models that exceed physical memory to run without crashing the system.

Key features and how it works

Hypura reads GGUF files, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

  • GPU (Metal) — Attention layers, norms, embeddings
  • RAM — Overflow layers that don't fit in the GPU working set, accessed via mmap
  • NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass

For MoE models like Mixtral, Hypura implements expert-streaming: only non-expert tensors (~1 GB) stay on GPU, while expert tensors stream from NVMe through a pool buffer on demand. It includes a neuron cache with 99.5% hit rate that eliminates most I/O after warmup, router interception to identify selected experts, and co-activation tracking to predict which experts will fire next for speculative prefetch.

For dense models like Llama 70B, it uses dense FFN-streaming: attention + norms stay on GPU (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer with scaled prefetch lookahead.

Ad

Performance benchmarks

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read:

  • Qwen 2.5 14B Q4_K_M (8.4 GB): Full-resident mode, 21 tok/s (same as llama.cpp)
  • Mixtral 8x7B Q5_K_M (30.9 GB): Expert-streaming mode, 2.2 tok/s (llama.cpp OOM)
  • Llama 3.3 70B Q4_K_M (39.6 GB): Dense-FFN-streaming mode, 0.3 tok/s (llama.cpp OOM)

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

Installation

Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

Multi-Agent Trading Council System Using GPT-5.1 and Claude 4.6
Tools

Multi-Agent Trading Council System Using GPT-5.1 and Claude 4.6

A developer built a multi-agent trading system using ZagiHQ for orchestration with three parallel data-gathering agents and three LLMs (GPT-5.1, Claude 4.6 Opus, Claude 4.6 Sonnet) that must agree on trades. The system filters out setups through disagreement and requires manual approval.

OpenClawRadar
Comparing OpenClaw and Claude Cowork: Local Automation vs Sandboxed Workflows
Tools

Comparing OpenClaw and Claude Cowork: Local Automation vs Sandboxed Workflows

OpenClaw is an always-on local agent that runs on your machine with shell command execution and browser automation, while Claude Cowork operates within Claude Desktop in a sandboxed environment focused on document and browser tasks.

OpenClawRadar
YantrikClaw Fork Adds Cognitive Memory, Companion Mode, and Tier-Aware Tools to ZeroClaw
Tools

YantrikClaw Fork Adds Cognitive Memory, Companion Mode, and Tier-Aware Tools to ZeroClaw

YantrikClaw is a fork of ZeroClaw that introduces three major features: Cognitive Memory with YantrikDB for persistent semantic recall, Companion Mode with bond tracking and proactive behavior, and tier-aware tool selection that adapts to model size from Raspberry Pi to large clusters.

OpenClawRadar
Running NemoClaw with Local vLLM: Setup Notes and Agent Engineering Observations
Tools

Running NemoClaw with Local vLLM: Setup Notes and Agent Engineering Observations

A developer documented running NVIDIA's NemoClaw sandboxed AI agent platform with a local Nemotron 9B v2 model via vLLM on WSL2. Key findings include inference routing details, parser compatibility issues, and observations about the agent engineering gap.

OpenClawRadar