Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning
Zhihang Yuan, Chengyu Yue, Long Huang, Litu Ou, Lei Shi

TL;DR
This paper introduces GRADFILTERING, a novel uncertainty-aware data selection method for instruction tuning of large language models that improves efficiency and performance by focusing on gradient signal-to-noise ratios.
Contribution
The paper presents a new data selection framework that leverages a small GPT-2 proxy and gradient signal-to-noise ratios, outperforming existing methods in efficiency and effectiveness.
Findings
GRADFILTERING matches or surpasses baselines in evaluation metrics.
Selected data subsets converge faster than competing filters.
The method enhances interpretability through uncertainty-aware scoring.
Abstract
Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
