Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy
Minsang Kim, Seungjun Baek

TL;DR
This paper introduces a data pruning method for language models based on information entropy, aiming to improve training efficiency and model generalization by removing less informative samples.
Contribution
It proposes a novel entropy-based sample ranking method for data pruning in language model training, enhancing efficiency and performance.
Findings
Entropy-based pruning improves model generalization
Reduces training data without sacrificing accuracy
Enhances training efficiency for large language models
Abstract
Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the samples in the training corpus be ranked in terms of their informativeness which we estimate through entropy functions. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We use the entropy functions based on the negative log-likelihood and the average inverse word frequency of a sample as a surrogate to measure its informativeness. Experiments reveal that the proposed information-based pruning can improve upon various language modeling and downstream tasks, and enhance the generalization capability of language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics · Educational Technology and Assessment · Open Education and E-Learning
MethodsPruning
