From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Haodong Wu; Jiahao Zhang; Lijie Hu; Yongqi Zhang

arXiv:2605.12944·cs.LG·May 14, 2026

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang

PDF

1 Repo

TL;DR

This paper introduces AutoSelection, a novel method for fixed-pool data recipe search in supervised fine-tuning, which optimizes data selection through learned recipes rather than simple ranking, leading to improved model reasoning performance.

Contribution

The paper formulates data selection as fixed-pool recipe search and proposes AutoSelection, a two-layer solver that efficiently discovers high-quality data subsets without full evaluations.

Findings

01

AutoSelection outperforms full-data training and other baselines in in-distribution reasoning.

02

Recipe structure significantly impacts model performance beyond individual operators.

03

AutoSelection achieves strong results on a 90K instruction pool across multiple models.

Abstract

Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top- $k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

w253/AutoSelection
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.