EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

Ting-Wei Li; Sirui Chen; Jiaru Zou; Yingbing Huang; Tianxin Wei; Jingrui He; Hanghang Tong

arXiv:2604.26170·cs.CL·April 30, 2026

EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

Ting-Wei Li, Sirui Chen, Jiaru Zou, Yingbing Huang, Tianxin Wei, Jingrui He, Hanghang Tong

PDF

TL;DR

EvoSelect is a novel data-efficient framework that iteratively evolves large language models for targeted tasks by selecting diverse, task-aligned synthetic data through an optimal transport-based relevance measure.

Contribution

It introduces a selection step in the data generation-training loop, combining task relevance estimation and diversity promotion to improve LLM adaptation.

Findings

01

EvoSelect outperforms existing data selection methods across multiple benchmarks.

02

It effectively leverages both weak and strong data generators for targeted task adaptation.

03

The framework enhances model performance with less data and fewer iterations.

Abstract

Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.