Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Zeyuan Allen-Zhu; Yuanzhi Li

arXiv:2404.05405·cs.CL·April 9, 2024·3 cites

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Zeyuan Allen-Zhu, Yuanzhi Li

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper investigates how language models store factual knowledge, establishing that they can store approximately 2 bits per parameter, and explores factors affecting this capacity, including architecture, training, and data signals.

Contribution

It introduces a method to estimate knowledge capacity in language models and provides new insights into how architecture and training influence knowledge storage.

Findings

01

Language models store about 2 bits of knowledge per parameter.

02

A 7B model can store more knowledge than entire Wikipedia and textbooks.

03

Prepending domain names to training data enhances knowledge capacity.

Abstract

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 5Confidence 3

Strengths

• Originality: Introducing a framework to measure language model capacity in bits per parameter is a novel approach that adds a quantitative dimension to model evaluation. • Methodology: The use of controlled synthetic datasets allows for the isolation of specific variables, providing clarity in the analysis of different factors affecting knowledge capacity.

Weaknesses

• Formatting Issues: The absence of a conclusion section and unclear figures detract from the overall quality of the paper and impede the reader’s understanding. • Generalization to Real-world Data: The heavy reliance on synthetic data limits the applicability of the findings to natural language processing tasks involving complex and diverse datasets. • Incomplete Exploration of Quantization: The paper does not investigate quantization during training, which could provide insights into mitigat

Reviewer 02Rating 6Confidence 3

Strengths

1. The authors propose a novel method to measure the knowledge stored in an LLM by a bit complexity lower bound, and show that the amount of information stored in a single parameter is approximately 2 bits. 2. The authors investigate various influencing factors to the knowledge storage capacity of LLMs, which provide many practical insights to LLM training.

Weaknesses

1. What do the long plateaus in the stored information mean? Does the maximum information reached by the plateaus equal to the bit complexity upper bound? The main scaling law only describes the linear increasing part, not the plateaus part. 2. The scaling law seems to be incomplete. i.e. the paper describes the influencing factors to the stored knowledge separately. Is there an empirical law that can describe and summarize all the results? 3. What does it mean if we say that the LLM stores N

Reviewer 03Rating 8Confidence 4

Strengths

- Scaling laws are of interest to the community as they allow us to analyze and understand the capacity of large language models. They also enable the design of new models and pertaining experiments. The authors claim they are the first to propose a scaling law for the knowledge capacity of LLMs. - The paper contains a very thorough and exhaustive list of experiments, answering several questions. - There are practical recommendations about LLM model builders, such as domain tagging for perta

Weaknesses

- the paper is very dense, the appendix is very many pages long. As a result, it is not easy to absorb all important details. - the graphs are in very small scale and are not explained appropriately.

Code & Models

Models

🤗
fzmnm/TinyStoriesAdv_v2_92M
model· 3 dl· ♡ 1
3 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Linear Layer · Layer Normalization · Weight Decay · Dense Connections · Attention Dropout