Krasis: Hybrid CPU/GPU Runtime for Large MoE Models Achieves 3,324 tok/s Prefill on RTX 5080

✍️ OpenClawRadar📅 Published: February 27, 2026🔗 Source
Krasis: Hybrid CPU/GPU Runtime for Large MoE Models Achieves 3,324 tok/s Prefill on RTX 5080
Ad

Krasis is a hybrid CPU/GPU runtime specifically designed for large Mixture-of-Experts (MoE) models. The core approach uses GPU for the computationally expensive prefill phase while CPU handles decode, with system RAM providing additional capacity to maximize performance.

Benchmark Results

RTX 5080 Configuration:

  • Hardware: AMD 5900X, DDR4-3200, 1x RTX 5080 16GB, PCIe 4.0 x16
  • Qwen3-Coder-Next (80B) Q4: 3,324 tok/s prefill, 9.7s TTFT (35K context), 14.9 tok/s decode

EPYC Configuration:

  • Hardware: AMD EPYC 7742 (64c), DDR4-2666 8-channel, 1x RTX 2000 Ada 16GB, PCIe 4.0 x8
  • Qwen3-Coder-Next (80B) Q4: 1,060 tok/s prefill, 18.9s TTFT, 15.8 tok/s decode
  • Qwen3-Coder-Next (80B) Q8: 873 tok/s prefill, 40.1s TTFT, 12.4 tok/s decode
  • Qwen3.5-35B-A3B Q4: 1,374 tok/s prefill, 14.6s TTFT, 15.0 tok/s decode
  • Qwen3-235B-A22B Q4: 289 tok/s prefill, 69.1s TTFT, 3.4 tok/s decode
  • DeepSeek V2-Lite (16B) Q4: 1,477 tok/s prefill, 13.6s TTFT, 20.2 tok/s decode
  • DeepSeek V2-Lite (16B) Q8: 1,317 tok/s prefill, 15.2s TTFT, 17.8 tok/s decode

Benchmarks used 10K–50K token prompts for prefill (best of 20K/35K/50K reported) and 64-token generation for decode (average of 3 runs).

Ad

How It Works

Unlike standard runtimes that offload only a few layers to GPU and run most of the model on CPU, Krasis treats the GPU as a streaming compute engine. It pushes the model through VRAM as fast as possible, hiding transfers under concurrent compute. The GPU handles the full prefill pass, then the CPU handles decode.

Tradeoffs

  • RAM hungry: Requires ~2.5x the quantized model weight in system RAM (e.g., ~100GB for Qwen3-Coder-Next at Q4)
  • NVIDIA cards only
  • Specifically targeted at MoE models (decode would be slow on dense models)
  • First run is slow due to preprocessing and caching
  • Disk hungry: Requires original BF16 safetensors file and stores cached transcoded models (~2x quantized model size)

Supported Models

Qwen3-Coder-Next (most thoroughly tested), Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite. Other models coming soon.

Technical Details

  • Written in Rust + Python (for orchestration)
  • OpenAI-compatible API (works with Cursor, OpenCode, etc.)
  • Interactive launcher for configuration
  • SSPL licensed (free to use, modify, distribute)
  • GitHub: https://github.com/brontoguana/krasis

The developer is seeking feedback on which models to support next, thoughts on the tradeoffs, and benchmarks from users with 5-series cards and PCIe 5.0.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also