Entropy-Based Data Selection for Language Models
Hongming Li, Yang Liu, Chao Huang

TL;DR
This paper introduces EUDS, an entropy-based data selection framework that enhances fine-tuning efficiency of language models by reducing data and computational costs, validated across multiple NLP tasks.
Contribution
The paper presents a novel entropy-based unsupervised data selection method that improves fine-tuning efficiency under resource constraints, with theoretical and empirical validation.
Findings
EUDS reduces computational costs significantly.
EUDS improves training efficiency with less data.
Validated effectiveness across sentiment, topic, and Q&A tasks.
Abstract
Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Computational and Text Analysis Methods
