RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection

Yixin Yang; Qingxiu Dong; Linli Yao; Fangwei Zhu; Zhifang Sui

arXiv:2505.05327·cs.CL·May 20, 2025

RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection

Yixin Yang, Qingxiu Dong, Linli Yao, Fangwei Zhu, Zhifang Sui

PDF

Open Access

TL;DR

RICo is a gradient-free data selection method that accurately identifies high-contribution samples for instruction tuning, significantly improving LLM performance with less data and lower costs.

Contribution

Introduces RICo, a novel gradient-free contribution measurement method for efficient data selection in instruction tuning of large language models.

Findings

01

Models trained on RICo-selected data outperform full datasets.

02

Rico-selected samples include diverse tasks and appropriate difficulty levels.

03

Significant performance gains on multiple benchmarks.

Abstract

Data selection for instruction tuning is crucial for improving the performance of large language models (LLMs) while reducing training costs. In this paper, we propose Refined Contribution Measurement with In-Context Learning (RICo), a novel gradient-free method that quantifies the fine-grained contribution of individual samples to both task-level and global-level model performance. RICo enables more accurate identification of high-contribution data, leading to better instruction tuning. We further introduce a lightweight selection paradigm trained on RICo scores, enabling scalable data selection with a strictly linear inference complexity. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of RICo. Remarkably, on LLaMA3.1-8B, models trained on 15% of RICo-selected data outperform full datasets by 5.42% points and exceed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling