A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)
Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

TL;DR
This paper systematically analyzes instruction selection methods for fine-tuning large language models, revealing that gradient-based data representations with greedy algorithms perform well at low budgets and providing a unified theoretical framework for selection algorithms.
Contribution
It disentangles data representation and selection algorithms, offering a controlled comparison framework and unifying existing methods as approximate distance minimization.
Findings
Gradient-based representations predict performance across datasets.
Greedy selection with gradient-based representations works best at low budgets.
Unified view of selection algorithms as approximate distance minimization.
Abstract
Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
