FairyFuse Achieves 29.6x Kernel Speedup on CPUs via Ternary Weight Multiplication-Free Inference

✍️ OpenClawRadar📅 Published: May 13, 2026🔗 Source
Ad

FairyFuse is an inference system for ternary (values in {-1,0,+1}) LLMs on commodity CPUs. By fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, it eliminates all floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup over conventional dequantize-and-multiply kernels. Notably, the approach offers little benefit on GPUs.

Ad

Key Results

  • End-to-end throughput: 32.4 tokens per second on a single Intel Xeon 8558P.
  • Comparison to llama.cpp Q4_K_M: 1.24x faster with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 for FP16; downstream accuracy 66.0% vs. 66.0% FP16).
  • Weight compression: 16x (2 bits per weight) due to ternary representation — no dequantization to FP needed.
  • Technique: Fuses eight sub-GEMVs into a single AVX-512 loop using masked adds/subtracts — no floating-point multiplications at all.

Context

Prior work (Fairy2i) showed that ternary LLMs can match FP16 quality, but runtime didn't exploit the structure. FairyFuse bridges that gap by rearchitecting inference to be multiplication-free on x86 CPUs with AVX-512.

📖 Read the full source: HN LLM Tools

Ad

👀 See Also