TL;DR
This paper introduces AutoSelection, a novel method for fixed-pool data recipe search in supervised fine-tuning, which optimizes data selection through learned recipes rather than simple ranking, leading to improved model reasoning performance.
Contribution
The paper formulates data selection as fixed-pool recipe search and proposes AutoSelection, a two-layer solver that efficiently discovers high-quality data subsets without full evaluations.
Findings
AutoSelection outperforms full-data training and other baselines in in-distribution reasoning.
Recipe structure significantly impacts model performance beyond individual operators.
AutoSelection achieves strong results on a 90K instruction pool across multiple models.
Abstract
Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top- subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
