Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

Zhihang Yuan; Chengyu Yue; Long Huang; Litu Ou; Lei Shi

arXiv:2601.13697·cs.CL·January 21, 2026

Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

Zhihang Yuan, Chengyu Yue, Long Huang, Litu Ou, Lei Shi

PDF

Open Access

TL;DR

This paper introduces GRADFILTERING, a novel uncertainty-aware data selection method for instruction tuning of large language models that improves efficiency and performance by focusing on gradient signal-to-noise ratios.

Contribution

The paper presents a new data selection framework that leverages a small GPT-2 proxy and gradient signal-to-noise ratios, outperforming existing methods in efficiency and effectiveness.

Findings

01

GRADFILTERING matches or surpasses baselines in evaluation metrics.

02

Selected data subsets converge faster than competing filters.

03

The method enhances interpretability through uncertainty-aware scoring.

Abstract

Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications