Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training

Changhao Jiang; Ming Zhang; Yifei Cao; Junjie Ye; Xiaoran Fan; Shihan Dou; Zhiheng Xi; Jiajun Sun; Yi Dong; Yujiong Shen; Jingqi Tong; Baoyu Fan; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2502.04066·cs.CL·January 8, 2026

Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training

Changhao Jiang, Ming Zhang, Yifei Cao, Junjie Ye, Xiaoran Fan, Shihan Dou, Zhiheng Xi, Jiajun Sun, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Size-dependent Mutual Information (SMI), a method to predict a language model's knowledge retention capacity from pre-training signals, revealing an intrinsic upper bound and guiding future model scaling and augmentation strategies.

Contribution

The work presents SMI, a novel information-theoretic predictor for estimating knowledge retention in language models before training, validated across large-scale datasets and models.

Findings

01

SMI outperforms baselines in predicting QA accuracy

02

Diminishing returns from scaling data and model size

03

Evidence of an intrinsic upper bound on knowledge retention

Abstract

The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^{2}$ >…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhui1038/smi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection