FairyFuse Achieves 29.6x Kernel Speedup on CPUs via Ternary Weight Multiplication-Free Inference
FairyFuse is an inference system for ternary (values in {-1,0,+1}) LLMs on commodity CPUs. By fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, it eliminates all floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup over conventional dequantize-and-multiply kernels. Notably, the approach offers little benefit on GPUs.
Key Results
- End-to-end throughput: 32.4 tokens per second on a single Intel Xeon 8558P.
- Comparison to llama.cpp Q4_K_M: 1.24x faster with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 for FP16; downstream accuracy 66.0% vs. 66.0% FP16).
- Weight compression: 16x (2 bits per weight) due to ternary representation — no dequantization to FP needed.
- Technique: Fuses eight sub-GEMVs into a single AVX-512 loop using masked adds/subtracts — no floating-point multiplications at all.
Context
Prior work (Fairy2i) showed that ternary LLMs can match FP16 quality, but runtime didn't exploit the structure. FairyFuse bridges that gap by rearchitecting inference to be multiplication-free on x86 CPUs with AVX-512.
📖 Read the full source: HN LLM Tools
👀 See Also

Qwen 35B-A3B as always-on agent on 16GB M4 Mac: disk I/O fails before RAM
Running Qwen 35B-A3B with llama.cpp on a 16GB M4 Mac works for batch inference, but an always-on agentic loop alongside Claude Code and Codex CLI causes SSD contention that leads to system instability and missed cron jobs, despite RAM being fine.

Cursor's Composer 2.0 appears to use Kimi 2.5 model based on API endpoint evidence
Network analysis shows Cursor's Composer 2.0 sends requests to an endpoint containing 'kimi-k2p5-rl-0317-s515-fast', suggesting it's based on Kimi 2.5. The modified MIT license reportedly requires attribution but minimal other obligations.

Reddit user proposes timestamping feature for Claude to address temporal awareness gap
A Reddit user identifies Claude's lack of temporal awareness as a limitation for productivity use cases and proposes an optional timestamping feature that would stamp every response with date and time, persistent across sessions.

Claude Code users hitting usage limits faster than expected, bugs suspected
Anthropic acknowledges Claude Code users are exhausting quotas 'way faster than expected,' with users reporting maxed-out limits within hours. Suspected bugs in prompt caching may be inflating costs by 10-20x, and downgrading to version 2.1.34 reportedly helps.