Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Huitao Yang, Hengzhi He, Guang Cheng

TL;DR
This paper introduces a new data selection method for targeted training that uses a validation-induced flow to score candidates, reducing costs and improving flexibility over existing attribution-based approaches.
Contribution
It proposes a target-aligned reference path based on validation warmup, enabling efficient and reusable data selection without candidate gradients or Hessian computations.
Findings
The method achieves competitive performance with strong baselines.
It significantly reduces warmup and storage costs.
The approach is effective across logistic, vision, and instruction-tuning tasks.
Abstract
Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
