Krasis: Hybrid CPU/GPU Runtime for Large MoE Models Achieves 3,324 tok/s Prefill on RTX 5080

Krasis is a hybrid CPU/GPU runtime specifically designed for large Mixture-of-Experts (MoE) models. The core approach uses GPU for the computationally expensive prefill phase while CPU handles decode, with system RAM providing additional capacity to maximize performance.
Benchmark Results
RTX 5080 Configuration:
- Hardware: AMD 5900X, DDR4-3200, 1x RTX 5080 16GB, PCIe 4.0 x16
- Qwen3-Coder-Next (80B) Q4: 3,324 tok/s prefill, 9.7s TTFT (35K context), 14.9 tok/s decode
EPYC Configuration:
- Hardware: AMD EPYC 7742 (64c), DDR4-2666 8-channel, 1x RTX 2000 Ada 16GB, PCIe 4.0 x8
- Qwen3-Coder-Next (80B) Q4: 1,060 tok/s prefill, 18.9s TTFT, 15.8 tok/s decode
- Qwen3-Coder-Next (80B) Q8: 873 tok/s prefill, 40.1s TTFT, 12.4 tok/s decode
- Qwen3.5-35B-A3B Q4: 1,374 tok/s prefill, 14.6s TTFT, 15.0 tok/s decode
- Qwen3-235B-A22B Q4: 289 tok/s prefill, 69.1s TTFT, 3.4 tok/s decode
- DeepSeek V2-Lite (16B) Q4: 1,477 tok/s prefill, 13.6s TTFT, 20.2 tok/s decode
- DeepSeek V2-Lite (16B) Q8: 1,317 tok/s prefill, 15.2s TTFT, 17.8 tok/s decode
Benchmarks used 10K–50K token prompts for prefill (best of 20K/35K/50K reported) and 64-token generation for decode (average of 3 runs).
How It Works
Unlike standard runtimes that offload only a few layers to GPU and run most of the model on CPU, Krasis treats the GPU as a streaming compute engine. It pushes the model through VRAM as fast as possible, hiding transfers under concurrent compute. The GPU handles the full prefill pass, then the CPU handles decode.
Tradeoffs
- RAM hungry: Requires ~2.5x the quantized model weight in system RAM (e.g., ~100GB for Qwen3-Coder-Next at Q4)
- NVIDIA cards only
- Specifically targeted at MoE models (decode would be slow on dense models)
- First run is slow due to preprocessing and caching
- Disk hungry: Requires original BF16 safetensors file and stores cached transcoded models (~2x quantized model size)
Supported Models
Qwen3-Coder-Next (most thoroughly tested), Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite. Other models coming soon.
Technical Details
- Written in Rust + Python (for orchestration)
- OpenAI-compatible API (works with Cursor, OpenCode, etc.)
- Interactive launcher for configuration
- SSPL licensed (free to use, modify, distribute)
- GitHub: https://github.com/brontoguana/krasis
The developer is seeking feedback on which models to support next, thoughts on the tradeoffs, and benchmarks from users with 5-series cards and PCIe 5.0.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Codesight: AI Context Engine Cuts 30K-60K Tokens from Claude Code Sessions
Codesight is an open-source tool that analyzes codebases to provide AI coding agents with structured context, reducing token waste. A developer collaborated with the maintainer to add AST parsing for Next.js and Prisma, an eval suite, token telemetry, and profiles for Claude Code and Cursor.

OpenClaw vs Hermes: Choose the Right Self-Hosted AI Agent After 100+ Deployments
After deploying 100+ AI agents for clients, a Reddit user shares hard-won lessons: OpenClaw (149K stars) is the reliable workhorse for single/small fleets; Hermes excels at multi-agent orchestration but has a smaller community.

OpenClaw Implements Agent History Compression to Reduce Context Usage
OpenClaw now compresses agent history by replacing completed subtask logs with structured summaries, reducing ~1M tokens to ~30K. The system uses a 4-pass scanner to identify task lifecycles and generates masked summaries that maintain agent compatibility.

Manifest Now Supports Claude Pro/Max Subscriptions Without API Key
Manifest, an open source routing layer for OpenClaw, now allows direct connection of Claude Pro or Max subscriptions without requiring an API key. Users with API keys can configure fallback routing when subscription rate limits are hit.