TL;DR
Select2Reason is a data selection framework that efficiently identifies high-utility long-CoT reasoning instructions, enabling large language models to achieve competitive performance with significantly less training data.
Contribution
The paper introduces Select2Reason, a novel method for automatic selection of high-quality long-CoT reasoning instructions, reducing training overhead while maintaining or improving performance.
Findings
Fine-tuning on 10% of selected data matches full-data performance.
Select2Reason outperforms baseline methods on multiple benchmarks.
The approach is scalable and adaptable to different instruction pools.
Abstract
A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
See Summary
See Summary
1. The method relies on two intuitive and low-cost heuristics (trace length and difficulty). The finding that trace length, in particular, serves as a proxy for data quality is a useful, simple baseline. 2. The central finding—that a 10% subset can outperform the 100% full dataset—is compelling and demonstrates a path toward more efficient tuning for complex reasoning tasks.
1. Validation is limited. The primary claim that 10% selected data surpasses 100% full-pool data is demonstrated on only one model and dataset (Qwen2.5-Math-7B on OpenR1-Math). While generalization is explored, this core efficiency-performance tradeoff is not shown to hold for other model families (e.g., Llama, Mistral). The paper would be stronger if this main result (Table 1) were replicated on at least one other distinct model architecture. 2. Key baselines are not included. The paper discus
Methodological Rigor: The framework is simple yet effective, leveraging quantifiable metrics (difficulty scores, trace length) without complex computations. Comprehensive Experiments: Evaluations span 9 mathematical benchmarks, multiple data scales (2%–10%), and diverse models (Qwen, LLaMA), ensuring reliability. Reproducibility: Experiments are based on public datasets.
Metric Interplay Under-explored: The combination of difficulty and trace length relies on a simple weighted sum. A deeper analysis of their correlation or conflict scenarios (e.g., long-easy vs. short-hard traces) would strengthen the method's rationale. Potential Bias in Difficulty Scoring: The "LLM-as-a-Judge" approach may inherit biases from the specific judge model (Qwen2.5-Math-7B). Using a judge model to filter data might systematically favor a certain type of "difficult problem," while ov
This work tackles data efficiency for long-CoT reasoning, which is a key bottleneck in current LLM research. It combines reasoning trace length and difficulty in a joint ranker is intuitive.
1. The paper motivates the heuristics empirically, but does not provide deeper theoretical analysis on why the combination works, lacking theoretical grounding 2. Both “trace length” and “difficulty” metrics are individually known heuristics; their combination, while practical, may be viewed as incremental. 3. Using LLM-as-a-Judge to estimate difficulty may introduce selection bias; this aspect is not systematically analyzed
Presented method yields stronger performance than tested simpler heuristics, which might have practical applications for training on CoT data with limited compute budget. The paper is well written, and method clearly presented and easy to understand.
The main issue of the paper might be the limited novelty. As mentioned by the authors (Introduction & Preliminary Exploration sections), data selection based on length and difficulty was already explored by multiple authors, and this work combines both metrics by weighted ranking. The method is complex, and provides only marginal gains over much simpler heuristic based on generation length. Moreover, the LLM-as-a-Judge used for assesing difficulty of the problem relies heavily on the capability
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
