Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp

Performance Benchmarks
Krasis demonstrates significant performance improvements over llama.cpp when running on equivalent hardware. On a single 5090 GPU limited by PCIE 4.0, Krasis shows:
- 8.9x faster prefill speed
- 4.7x faster decode speed
Specific benchmark results for Qwen3-Coder-Next show Krasis running on a single 16GB 5080 GPU achieving:
- 1801 tokens/sec prefill
- 26.8 tokens/sec decode
This outperforms llama.cpp running on a 32GB 5090 GPU with layer offloading.
Architecture Changes
The latest version of Krasis has dropped the dual-format system and now runs both prefill and decode entirely on GPU with different optimization strategies for each phase. This architectural change results in:
- Reduced CPU requirements
- Less dependency on system RAM memory speed
- Lower overall system RAM usage (now needs only enough for the quantized model plus some overhead, compared to the prior 2.5x model requirement)
Supported Models and Performance
Current supported models with their performance on a single 5090 GPU (PCIE 4.0) are:
- Qwen3.5-35B-A3B: 4475 prefill, 109.1 decode
- Qwen3-Coder-Next: 3560 prefill, 70.3 decode
- Qwen3.5-122B-A10B: 2897 prefill, 27.7 decode
- Qwen3-235B-A22B: 2124 prefill, 9.3 decode
Future Development Plans
The developer plans to:
- Add support for Nvidia Nemotron models, specifically targeting Nemotron Super for consumer GPUs like the 5080
- Potentially support larger Nemotron models when released
- Expand IDE and tooling support for Opencode and Aider
Current Features
Krasis currently offers:
- OpenAI-compatible server
- Single-line installation
- Availability on GitHub
📖 Read the full source: r/LocalLLaMA
👀 See Also

Codev: AI agent workflow for 106 PRs in 14 days
Codev is an open-source system that coordinates multiple AI agents through a strict Spec→Plan→Implement→Review→PR workflow, catching 20 bugs before shipping and producing code rated 1.2 points better on a 10-point scale.

Claude Command Center v5.0.0 Adds Day-One Support for Fable 5 with Mid-Session Switching
Claude Command Center v5.0.0 adds first-class support for Anthropic's new Fable 5 tier, including mid-session model switching, a redesigned model picker, and a fix for versioned alias CLI errors.

Open-source Agent OS: Rust-based OS for AI agents with WASM sandboxing and Hands feature
An open-source operating system for AI agents has been released with 137k lines of Rust code under MIT license. The system runs agents in WASM sandboxes with 16 security layers and introduces 'Hands' for scheduled, autonomous agent operation.

Manifest Router Adds ZAI Subscription Support for OpenClaw Model Management
Manifest router now supports ZAI subscriptions, allowing all ZAI models to appear in routing tiers with automatic model selection per request. The tool is in beta, free, open source, and includes a dashboard for tracking costs per agent, message, and model.