FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
Fei Zuo, Xiaoyan Xi, Quanyi Zeng, Feiyu Wang, Ho Fai Leung

TL;DR
FairyFuse introduces a CPU inference system for large language models that eliminates multiplications using fused ternary kernels, significantly boosting speed while maintaining near-lossless quality.
Contribution
It presents a novel fused ternary kernel approach enabling multiplication-free LLM inference on CPUs, outperforming existing systems in speed with minimal quality loss.
Findings
Achieves 29.6x kernel speedup on CPU with ternary weights.
End-to-end inference reaches 32.4 tokens/sec on a single CPU.
Maintains near-lossless quality comparable to FP16 models.
Abstract
Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
