Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang, Ming Zhang, Yifei Cao, Junjie Ye, Xiaoran Fan, Shihan Dou, Zhiheng Xi, Jiajun Sun, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper introduces Size-dependent Mutual Information (SMI), a method to predict a language model's knowledge retention capacity from pre-training signals, revealing an intrinsic upper bound and guiding future model scaling and augmentation strategies.
Contribution
The work presents SMI, a novel information-theoretic predictor for estimating knowledge retention in language models before training, validated across large-scale datasets and models.
Findings
SMI outperforms baselines in predicting QA accuracy
Diminishing returns from scaling data and model size
Evidence of an intrinsic upper bound on knowledge retention
Abstract
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves >…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection
