Memorization-Compression Cycles Improve Generalization
Fangyuan Yu

TL;DR
This paper demonstrates that internal representation compression during training enhances language model generalization, introduces a new objective (IBLM), and proposes a training algorithm (GAPT) that leverages memorization-compression cycles for better performance.
Contribution
It introduces the IBLM objective and the GAPT training algorithm, revealing the importance of memorization-compression cycles for improved generalization in language models.
Findings
GAPT reduces representation entropy by 50%
GAPT improves cross-entropy by 4.8%
GAPT enhances out-of-distribution generalization by 35%
Abstract
We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · Linear Layer · Weight Decay
