1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma,, Hongyu Wang, Yan Xia, Furu Wei

TL;DR
This paper introduces bitnet.cpp, a software stack enabling fast, lossless 1-bit LLM inference on CPUs, achieving significant speedups and facilitating efficient local deployment of large language models.
Contribution
The work develops specialized kernels and software for 1-bit LLM inference, significantly improving speed on CPUs and supporting lossless, efficient deployment.
Findings
Speedups of 2.37x to 6.17x on x86 CPUs
Speedups of 1.37x to 5.07x on ARM CPUs
Supports various model sizes with efficient inference
Abstract
Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Neural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training
