TSDS: Data Selection for Task-Specific Model Finetuning
Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

TL;DR
TSDS is a data selection framework that improves task-specific model finetuning by selecting representative and diverse data using optimal transport and efficient algorithms, often outperforming full datasets and baselines.
Contribution
Introduces TSDS, a novel data selection method for finetuning that uses optimal transport, diversity regularization, and efficient nearest neighbor algorithms.
Findings
Instruction tuning with TSDS often outperforms using full datasets.
TSDS outperforms baseline selection methods by 1.5 points in F1 score.
Selected data with 1% ratio achieves high task performance.
Abstract
Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBusiness Process Modeling and Analysis · Simulation Techniques and Applications · Model-Driven Software Engineering Techniques
MethodsSparse Evolutionary Training
