1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on   CPUs

Jinheng Wang; Hansong Zhou; Ting Song; Shaoguang Mao; Shuming Ma,; Hongyu Wang; Yan Xia; Furu Wei

arXiv:2410.16144·cs.CL·October 24, 2024

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma,, Hongyu Wang, Yan Xia, Furu Wei

PDF

Open Access 1 Repo

TL;DR

This paper introduces bitnet.cpp, a software stack enabling fast, lossless 1-bit LLM inference on CPUs, achieving significant speedups and facilitating efficient local deployment of large language models.

Contribution

The work develops specialized kernels and software for 1-bit LLM inference, significantly improving speed on CPUs and supporting lossless, efficient deployment.

Findings

01

Speedups of 2.37x to 6.17x on x86 CPUs

02

Speedups of 1.37x to 5.07x on ARM CPUs

03

Supports various model sizes with efficient inference

Abstract

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/bitnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Neural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training