Importance-Aware Data Selection for Efficient LLM Instruction Tuning

Tingyu Jiang; Shen Li; Yiyao Song; Lan Zhang; Hualei Zhu; Yuan Zhao; Xiaohang Xu; Kenjiro Taura; Hao Henry Wang

arXiv:2511.07074·cs.CL·November 11, 2025

Importance-Aware Data Selection for Efficient LLM Instruction Tuning

Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, Hao Henry Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces MIWV, a new metric for selecting high-impact instruction data for LLM tuning, showing that using top data based on MIWV can outperform full dataset training.

Contribution

The paper proposes MIWV, a novel importance-aware metric for data selection in instruction tuning, improving efficiency and performance of LLMs.

Findings

01

Selecting top 1% data by MIWV outperforms full dataset training.

02

MIWV effectively identifies the most beneficial instruction data.

03

Empirical results validate the superiority of MIWV-based data selection.

Abstract

Instruction tuning plays a critical role in enhancing the performance and efficiency of Large Language Models (LLMs). Its success depends not only on the quality of the instruction data but also on the inherent capabilities of the LLM itself. Some studies suggest that even a small amount of high-quality data can achieve instruction fine-tuning results that are on par with, or even exceed, those from using a full-scale dataset. However, rather than focusing solely on calculating data quality scores to evaluate instruction data, there is a growing need to select high-quality data that maximally enhances the performance of instruction tuning for a given LLM. In this paper, we propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities. The MIWV metric is derived from the discrepancies in the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Importance-Aware Data Selection for Efficient LLM Instruction Tuning· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Online Learning and Analytics