Entropy-Based Data Selection for Language Models

Hongming Li; Yang Liu; Chao Huang

arXiv:2602.17465·cs.CL·February 20, 2026

Entropy-Based Data Selection for Language Models

Hongming Li, Yang Liu, Chao Huang

PDF

Open Access

TL;DR

This paper introduces EUDS, an entropy-based data selection framework that enhances fine-tuning efficiency of language models by reducing data and computational costs, validated across multiple NLP tasks.

Contribution

The paper presents a novel entropy-based unsupervised data selection method that improves fine-tuning efficiency under resource constraints, with theoretical and empirical validation.

Findings

01

EUDS reduces computational costs significantly.

02

EUDS improves training efficiency with less data.

03

Validated effectiveness across sentiment, topic, and Q&A tasks.

Abstract

Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Computational and Text Analysis Methods