Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao,, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei

TL;DR
Bitnet.cpp is an optimized inference system for ternary LLMs that significantly accelerates edge inference by introducing a novel mpGEMM library with lossless, high-speed computation, setting new performance benchmarks.
Contribution
We introduce Bitnet.cpp, a novel inference system with a specialized mpGEMM library for efficient, lossless ternary LLM inference on edge devices, addressing a key research gap.
Findings
Achieves up to 6.25x speedup over full-precision baselines.
Achieves up to 2.32x speedup over low-bit baselines.
Sets new benchmarks in edge inference efficiency.
Abstract
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Advanced Data and IoT Technologies · Advanced Steganography and Watermarking Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Lib
