Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp

✍️ OpenClawRadar📅 Published: March 17, 2026🔗 Source

Performance Benchmarks

Krasis demonstrates significant performance improvements over llama.cpp when running on equivalent hardware. On a single 5090 GPU limited by PCIE 4.0, Krasis shows:

8.9x faster prefill speed
4.7x faster decode speed

Specific benchmark results for Qwen3-Coder-Next show Krasis running on a single 16GB 5080 GPU achieving:

1801 tokens/sec prefill
26.8 tokens/sec decode

This outperforms llama.cpp running on a 32GB 5090 GPU with layer offloading.

Architecture Changes

The latest version of Krasis has dropped the dual-format system and now runs both prefill and decode entirely on GPU with different optimization strategies for each phase. This architectural change results in:

Reduced CPU requirements
Less dependency on system RAM memory speed
Lower overall system RAM usage (now needs only enough for the quantized model plus some overhead, compared to the prior 2.5x model requirement)

Supported Models and Performance

Current supported models with their performance on a single 5090 GPU (PCIE 4.0) are:

Qwen3.5-35B-A3B: 4475 prefill, 109.1 decode
Qwen3-Coder-Next: 3560 prefill, 70.3 decode
Qwen3.5-122B-A10B: 2897 prefill, 27.7 decode
Qwen3-235B-A22B: 2124 prefill, 9.3 decode

Future Development Plans

The developer plans to:

Add support for Nvidia Nemotron models, specifically targeting Nemotron Super for consumer GPUs like the 5080
Potentially support larger Nemotron models when released
Expand IDE and tooling support for Opencode and Aider

Current Features

Krasis currently offers:

OpenAI-compatible server
Single-line installation
Availability on GitHub

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

Codev: AI agent workflow for 106 PRs in 14 days

Codev is an open-source system that coordinates multiple AI agents through a strict Spec→Plan→Implement→Review→PR workflow, catching 20 bugs before shipping and producing code rated 1.2 points better on a 10-point scale.

Mar 1, 2026, 09:45 PM UTC

OpenClawRadar

Tools

Claude Command Center v5.0.0 Adds Day-One Support for Fable 5 with Mid-Session Switching

Claude Command Center v5.0.0 adds first-class support for Anthropic's new Fable 5 tier, including mid-session model switching, a redesigned model picker, and a fix for versioned alias CLI errors.

Jun 11, 2026, 12:16 PM UTC

OpenClawRadar

Tools

Open-source Agent OS: Rust-based OS for AI agents with WASM sandboxing and Hands feature

An open-source operating system for AI agents has been released with 137k lines of Rust code under MIT license. The system runs agents in WASM sandboxes with 16 security layers and introduces 'Hands' for scheduled, autonomous agent operation.

Feb 27, 2026, 03:45 AM UTC

OpenClawRadar

Tools

Manifest Router Adds ZAI Subscription Support for OpenClaw Model Management

Manifest router now supports ZAI subscriptions, allowing all ZAI models to appear in routing tiers with automatic model selection per request. The tool is in beta, free, open source, and includes a dashboard for tracking costs per agent, message, and model.

Apr 16, 2026, 04:45 PM UTC

OpenClawRadar