Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning
Shenshen Li, Xing Xu, Kaiyuan Deng, Lei Wang, Heng Tao Shen, Fumin Shen

TL;DR
This paper introduces a data selection method called RAP that identifies high-value samples to train multi-modal large language models efficiently, achieving comparable or better reasoning performance with significantly less data and computational cost.
Contribution
The paper proposes a novel data selection paradigm for multi-modal reasoning models that focuses on cognitive samples, reducing data requirements and computational costs while maintaining high performance.
Findings
RAP achieves superior performance with only 9.3% of training data.
Reduces computational costs by over 43%.
Effective across six datasets.
Abstract
While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP)}, which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper addresses a practically important question for multimodal RL: how to identify and exploit the relatively small subset of examples that genuinely drive multimodal reasoning, rather than spending compute on language-prior–biased or trivial samples. The RAP framework is conceptually appealing in that it decomposes “reasoning activation” into three complementary dimensions: causal discrepancy (CDE, grounded in the potential outcome model), attention confidence (ACE, using final-layer self-
1.Static, base-model-conditioned selection. RAP performs a one-shot selection of “cognitive samples” using the pre-RL base model and then keeps this subset fixed for all subsequent RL updates. This inevitably ties the selected data distribution to the initial model’s inductive biases and failure modes: if the base model systematically under-utilizes certain reasoning patterns, those samples may be underrepresented or discarded. The paper does provide evidence that RAP generalizes across backbon
The paper presents an innovative approach to data efficiency by identifying cognitive samples, effectively challenging the belief that large-scale datasets are necessary for strong multi-modal reasoning. The proposed CDE and ACE modules are theoretically grounded and interpretable, offering clear insights into how data contributes to reasoning improvement. Experiments show strong empirical results, achieving comparable or better performance with only 9.3% of data and significantly reducing compu
- The innovation of the Output-level Discrepancy Calculation is limited. Using causal inference to evaluate whether multi-modal samples contain language priors has been explored in prior work, such as [1] Counterfactual VQA: A Cause-Effect Look at Language Bias and [2] Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis. - The paper lacks clear definitions for causal inference concepts, which makes the logical flow less coherent. For example, what is the formal defin
1. Interesting Problem: The approach addresses a compelling and relevant challenge. 2. Intuitive Approach: The methodology presents a clear, logical progression that is easy to follow and understand. 3. Well-Presented Work: The presentation of the work is thorough and well-organized
1. "However, existing data selection methods rely on unimodal textual quality, such as human-annotated difficulty estimation" - I think human annotation also has multimodal data. Why is there an emphasis on data selection methods relying on unimodal data quality? 2. The discussion of data selection for reasoning in the related works section lacks sufficient details on prior methods, and it repeatedly highlights the need for data selection without thoroughly addressing existing approaches. It wo
* Moves beyond generic “hard example mining” to explicitly target multi-modal activation (CDE) and weed out spurious process signals (ACE). * Practical efficiency: about 7–10% of data surpasses full-data RL, very attractive for teams constrained by GPU budgets. * Improvements hold across multiple datasets, two model sizes, and more than one RL algorithm. * CDE/ACE contribute complementary gains; DRM prevents the curated set from collapsing into “too easy,” addressing an important ceiling effect.
* ACE uses a length-multiplicative attention score without proper normalization. The attention confidence is effectively a product over a token run, e.g., $ \psi_j(A)=\prod_{i=j}^{L} (\sigma \cdot A_{i,j}) $. This formulation couples score magnitude to reasoning-chain length and can explode/vanish with longer sequences; it also mixes scale from $\sigma$ into the multiplicative path, making thresholding brittle. A log-domain or normalized aggregation (e.g., mean/log-sum-exp over a fixed window) w
1. The paper provides strong validation and in-depth analysis of the "Less is More" principle within the domain of Multi-modal Large Language Models. Through empirical analysis, the paper finds that training with only 20% of the data leads to merely a 0.8% performance degradation compared to the full dataset, leading to the proposal of the "truth in the few" phenomenon. Experimental results demonstrate that the method achieves superior performance compared to models trained on the full corpus, w
1. Risk of Insufficient Novelty - The paper's core finding, "less is more," is not an entirely new concept. Previous works in the context of large language models (LLMs), such as s1and LIMO, have already proposed the idea of using data selection to enhance reasoning performance and reduce training costs. - Although the concept is extended to the multi-modal domain in this work, the proposed Causal Discrepancy Estimator (CDE) is fundamentally a measure of output prediction difference between
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
