FairyFuse Achieves 29.6x Kernel Speedup on CPUs via Ternary Weight Multiplication-Free Inference

✍️ OpenClawRadar📅 Published: May 13, 2026🔗 Source

FairyFuse is an inference system for ternary (values in {-1,0,+1}) LLMs on commodity CPUs. By fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, it eliminates all floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup over conventional dequantize-and-multiply kernels. Notably, the approach offers little benefit on GPUs.

Key Results

End-to-end throughput: 32.4 tokens per second on a single Intel Xeon 8558P.
Comparison to llama.cpp Q4_K_M: 1.24x faster with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 for FP16; downstream accuracy 66.0% vs. 66.0% FP16).
Weight compression: 16x (2 bits per weight) due to ternary representation — no dequantization to FP needed.
Technique: Fuses eight sub-GEMVs into a single AVX-512 loop using masked adds/subtracts — no floating-point multiplications at all.

Context

Prior work (Fairy2i) showed that ternary LLMs can match FP16 quality, but runtime didn't exploit the structure. FairyFuse bridges that gap by rearchitecting inference to be multiplication-free on x86 CPUs with AVX-512.

📖 Read the full source: HN LLM Tools

👀 See Also

News

AIME 2026 Results: Both Open and Closed Models Score Above 90%

AI models achieve remarkable 90%+ scores on AIME 2026, with DeepSeek V3.2 running the entire test for just bash.09.

Feb 7, 2026, 08:37 PM UTC

OpenClaw Radar

News

Buddy turns down $300k+ role replacing 70% of staff with Claude agents — Reddit debates the moral and technical reality

A Reddit post describes a friend who refused a role as 'AI Transition Lead' to map workflows, build Claude/GPT agent pipelines, and fire 70% of staff. The poster argues the $300k+ bag is worth it to waste time and watch C-suite delusion crash.

Apr 28, 2026, 10:17 PM UTC

OpenClawRadar

News

MCP Works with Local Models Too — Server Ecosystem Maturing Fast

MCP isn't Claude-only. Local models with function calling work fine. Open Web UI now has basic MCP client. 13B+ models handle multi-step tools best.

Jun 18, 2026, 12:18 PM UTC

OpenClawRadar

News

Claude Code v2.1.90 Release: New Interactive Lessons, Performance Improvements, and Bug Fixes

Claude Code v2.1.90 introduces /powerup interactive lessons, adds the CLAUDE_CODE_PLUGIN_KEEP_MARKETPLACE_ON_FAILURE environment variable for offline use, and includes multiple performance improvements and bug fixes for tools, UI, and security.

Apr 3, 2026, 05:45 PM UTC

OpenClawRadar