ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Yang Wu; Huayi Zhang; Yizheng Jiao; Lin Ma; Xiaozhong Liu; Jinhong Yu; Dongyu Zhang; Dezhi Yu; Wei Xu

arXiv:2412.00631·cs.LG·September 1, 2025

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, Wei Xu

PDF

Open Access 1 Video 5 Reviews

TL;DR

ROSE introduces a reward-oriented data selection framework that improves task-specific instruction tuning of large language models by selecting highly relevant data with minimal training data, outperforming existing methods.

Contribution

The paper proposes ROSE, a novel data selection method using pairwise preference loss as a reward signal, enhancing instruction tuning efficiency and effectiveness.

Findings

01

ROSE achieves comparable performance with only 5% of training data.

02

It surpasses state-of-the-art data selection methods.

03

Demonstrates robustness across multiple datasets and models.

Abstract

Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

1. The method is simple but effective to replace the next-token-prediction gradient in LESS method with the DPO gradient. 2. The experimental results validate the effectiveness of the proposed method, which is impressive to surpass the performance of the full dataset training version.

Weaknesses

1. The authors claim that the validation loss fails to exhibit a monotonic relationship with the target task performance, which is counter-intuitive in machine learning theory. It would be better to provide more support evidence in the introduction section, such as experimental table etc. 2. The relationship between pairwise preference loss and win rate depicted in Figure3 is insufficient to substantiate the claim of "a more consistent correlation between reduced validation loss and increased

Reviewer 02Rating 5Confidence 3

Strengths

1. This study addresses an important issue in instruction tuning for LLMs, focusing on data selection to align model outputs more closely with real-world task performance. 2. It is novel to focus on reward maximization rather than traditional empirical risk minimization to optimize data selection, which may offer a fresh perspective on enhancing model alignment with human preferences 3. The experiments are conducted extensively with both qualitative and quantitative evaluations across various mo

Weaknesses

1. The study uses only 5% of the training dataset for model tuning. It would be beneficial to explore results with other proportions (e.g., 10%, 20%) to understand the method’s effectiveness at varying scales of data selection. 2. The comparison baseline primarily consists of traditional data selection methods. While ROSE employs GPT-4-32K-0613 model as a judge model, exploring data selection baselines with larger models could further validate ROSE’s effectiveness. 3. The study uses specific sho

Reviewer 03Rating 5Confidence 5

Strengths

The author has put forward a new anchor point for large models to screen data. By using reward signals instead of the traditional loss minimization, the data selection for task-specific instruction fine-tuning is optimized. This method utilizes pairwise preference loss as a reward signal, enabling the selected data to better enhance the performance of the model in actual tasks.

Weaknesses

1. Although the ROSE method has achieved remarkable results in data selection, its implementation involves complex gradient calculations and impact estimations, which may lead to high computational costs and implementation complexity, especially when dealing with large-scale datasets and models. 2. The ROSE method relies on a small number of preference validation sets to guide data selection, so the quality of the preference data is crucial to the final selection effect. If the preference data i

Reviewer 04Rating 5Confidence 3

Strengths

1. This paper attempts to address an important question and proposes an effective method that achieves better performance than the compared methods. 2. The motivation of this paper is clear and the proposed method is sound. The technical approach is sound and well-justified, with a clear connection to the theoretical underpinnings of Direct Preference Optimization (DPO) and influence functions. 3. The paper is well-organized and clearly written. The introduction provides a good motivation for th

Weaknesses

1. Lack of comparison with up-to-date task-specific methods [1,2]. 2. Evaluation Benchmarks: This method claims to be task-specific, yet the evaluation datasets used are general open-source preference benchmarks. Is there a need for further evaluation on specific tasks? For example: summarization. [1] One Shot Learning as Instruction Data Prospector for Large Language Models [2] Recost: External knowledge guided data-efficient instruction tuning

Reviewer 05Rating 5Confidence 4

Strengths

1. Building on LESS, this paper conducts valuable exploration into differentiable metrics beyond cross-entropy loss for data selection procedures. It identifies reward value as a potentially more beneficial objective for preference tasks. 2. ROSE's gradient norm cleverly addresses the issue in LESS where sequence length affected the influence function.

Weaknesses

1. ROSE's effectiveness has only been validated on the Preference Benchmark. However, to my knowledge, LESS has shown excellent performance across various task formats such as MMLU, TYDIQA, and BBH. I suspect this limitation is due to the nature of the pairwise preference loss, which may restrict ROSE's ability to extend to other tasks. 2. Given that ROSE introduces pairwise preference loss calculations in the data selection process, I'm unsure whether this increases the method's computational

Videos

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus