Scaling Laws for Precision

Tanishq Kumar; Zachary Ankner; Benjamin F. Spector; Blake Bordelon,; Niklas Muennighoff; Mansheej Paul; Cengiz Pehlevan; Christopher R\'e; Aditi; Raghunathan

arXiv:2411.04330·cs.LG·December 3, 2024

Scaling Laws for Precision

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon,, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R\'e, Aditi, Raghunathan

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper develops 'precision-aware' scaling laws that predict how low precision training and inference impact language model quality and cost, enabling better optimization of model size, data, and precision choices.

Contribution

It introduces unified scaling laws accounting for low precision effects during training and inference, validated on extensive pretraining data and larger models.

Findings

01

Lower precision reduces effective parameter count and increases loss predictability.

02

Post-training quantization degradation grows with more training data, potentially harming model performance.

03

Training larger models in lower precision can be more compute-efficient.

Abstract

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

Strengths: 1. The paper tackles an important issue with the introduction of a bit precision scaling law. While this topic has been explored before, the theoretical scaling law presented in this work offers valuable guidance for the efficient deployment of models in real-world applications. The implications of this work could be transformative for the field. 2. The authors have provided a wealth of experimental results that not only validate the existing scaling laws across different model size

Weaknesses

no clear weakness.

Reviewer 02Rating 8Confidence 3

Strengths

(1) The paper studies a meaningful topic, the scaling laws of precision, which is a new topic following the scaling law of data and parameters. (2) The paper gives a good presentation. I especially appreciate the introduction to quantization. I'm not familiar with how quantization works in detail, so it helps a lot. (3) The paper shows interesting findings in Sec. 3.1 Fig. 2: more pretraining tokens result in lower performance for post-train quantization with a high quantization rate. (4) The

Weaknesses

(1) The paper uses the dataset Dolma for experiments. Though it's hard, it would be interesting to see how pretraining data affects this law. (2) The paper uses the OLMo-style models for experiments. It would be great to give a general introduction to OLMo-style. Are they transformer-based model? While the abstract states the scaling law for language models, there would be other types of language models other than OLMo-style models, such as SSM.

Reviewer 03Rating 8Confidence 3

Strengths

- The paper introduces a new dimension to the well-established scaling laws by incorporating precision as a critical factor. This is an important contribution because most prior work focused on model size and dataset size without considering precision, which is increasingly relevant due to hardware advancements supporting lower-precision computations. By doing so, the authors offer a more comprehensive framework for understanding and optimizing model performance under different training and infe

Weaknesses

- While the paper focuses extensively on integer-type precisions (e.g., 3-bit, 8-bit), it does not explore floating-point types like FP8 or BF16 in as much depth. Given that floating-point formats are widely used in modern hardwares, this omission limits the generalizability of the findings to real-world applications where floating-point precision is common. This could limit the applicability of the scaling laws in environments where floating-point precision dominates, potentially requiring furt

Code & Models

Repositories

IST-DASLab/QuEST
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhilosophy and History of Science