Measuring Sample Importance in Data Pruning for Language Models based on   Information Entropy

Minsang Kim; Seungjun Baek

arXiv:2406.14124·cs.AI·December 13, 2024

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

Minsang Kim, Seungjun Baek

PDF

Open Access

TL;DR

This paper introduces a data pruning method for language models based on information entropy, aiming to improve training efficiency and model generalization by removing less informative samples.

Contribution

It proposes a novel entropy-based sample ranking method for data pruning in language model training, enhancing efficiency and performance.

Findings

01

Entropy-based pruning improves model generalization

02

Reduces training data without sacrificing accuracy

03

Enhances training efficiency for large language models

Abstract

Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the samples in the training corpus be ranked in terms of their informativeness which we estimate through entropy functions. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We use the entropy functions based on the negative log-likelihood and the average inverse word frequency of a sample as a surrogate to measure its informativeness. Experiments reveal that the proposed information-based pruning can improve upon various language modeling and downstream tasks, and enhance the generalization capability of language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOnline Learning and Analytics · Educational Technology and Assessment · Open Education and E-Learning

MethodsPruning