Continual Quantization-Aware Pre-Training: When to transition from   16-bit to 1.58-bit pre-training for BitNet language models?

Jacob Nielsen; Peter Schneider-Kamp; Lukas Galke

arXiv:2502.11895·cs.LG·February 18, 2025

Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?

Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke

PDF

Open Access

TL;DR

This paper proposes a training strategy for large language models that transitions from 16-bit to 1.58-bit quantization-aware training, reducing resource requirements while maintaining accuracy across multiple tasks.

Contribution

It introduces a novel 16-to-1.58-bit quantization-aware pre-training method that improves efficiency and preserves accuracy compared to full low-bit training.

Findings

01

The 16-to-1.58-bit strategy outperforms full 1.58-bit training.

02

Retaining optimizer state reduces loss spikes during transition.

03

Gradual quantization helps maintain model performance.

Abstract

Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Data Classification · Speech Recognition and Synthesis