BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang; Shuming Ma; Li Dong; Shaohan Huang; Huaijie Wang,; Lingxiao Ma; Fan Yang; Ruiping Wang; Yi Wu; Furu Wei

arXiv:2310.11453·cs.CL·October 18, 2023·26 cites

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang,, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

PDF

Open Access 2 Repos 6 Models

TL;DR

BitNet introduces a 1-bit Transformer architecture that enables training large language models with significantly reduced memory and energy use, while maintaining competitive performance and scalability.

Contribution

The paper presents BitLinear, a novel 1-bit weight training method, and demonstrates that BitNet scales effectively like full-precision models with substantial efficiency gains.

Findings

01

Achieves competitive language modeling performance with 1-bit weights.

02

Reduces memory footprint and energy consumption compared to 8-bit and FP16 models.

03

Exhibits a scaling law similar to full-precision Transformers.

Abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Adam · Byte Pair Encoding