Entropy Law: The Story Behind Data Compression and LLM Performance
Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang,, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen

TL;DR
This paper introduces an entropy law linking LLM performance to data compression and proposes a universal data selection method, ZIP, that improves training efficiency and detects potential risks early.
Contribution
It uncovers an entropy law connecting data redundancy and model performance, and develops a novel data selection method based on this law for more efficient LLM training.
Findings
Model performance negatively correlates with data compression ratio.
ZIP method outperforms existing data selection techniques.
Entropy law can detect early performance risks.
Abstract
Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an ``entropy law'' that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Distress and Bankruptcy Prediction · Advanced Data Storage Technologies · Auction Theory and Applications
