Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward
Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, Enhong Chen

TL;DR
Perception-R1 introduces a novel visual perception reward to explicitly enhance the perception and reasoning abilities of Multimodal Large Language Models, achieving state-of-the-art results with limited training data.
Contribution
The paper proposes Perception-R1, a new reinforcement learning approach that explicitly incentivizes accurate visual perception in MLLMs, addressing a key limitation of previous methods.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Effectively improves multimodal perception and reasoning capabilities.
Operates with only 1,442 training data samples.
Abstract
Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper reveals the impact of poor perception on reasoning performance. Current RLVR methods fail to enhance multimodal perception, which fundamentally limits the reasoning performance of MLLMs. 2. The introduced Perception-R1 framework incorporates a novel visual perception reward that significantly strengthens the visual understanding and reasoning capabilities of MLLMs, particularly in mathematical reasoning tasks. 3. Extensive experiments across multiple benchmarks verify that Percepti
1. The paper claims that it enhances the multimodal reasoning capabilities of MLLMs through improved perception. However, the presented results do not provide direct evidence that the observed performance gains stem specifically from enhanced perception. I suggest including an analysis or ablation that directly links perception improvement to the reasoning gains. 2. While the paper reports significant improvements on multimodal math benchmarks, these results primarily reflect reasoning performan
- The idea of augmenting RL with a verifiable visual perception signal represents a clear conceptual advance over prior RLVR frameworks (e.g., Vision-R1, MM-Eureka) that focus solely on final answer correctness. - The authors conduct extensive evaluations on multiple multimodal benchmarks, demonstrating the method's effectiveness and robustness. - The paper is well-structured and clearly written.
- The paper lacks systematic exploration of critical parameters such as the perception reward weight (γ) and judgment thresholds, leaving robustness questions unanswered. - Although data-efficient, the additional judging and reward assignment stages may increase computational overhead, which is not quantitatively discussed. - The paper would benefit from more qualitative evidence demonstrating how the model’s perception improves—e.g., visual attention maps, step-by-step perception-reasoning exam
1. The paper provides a clear and compelling statistical analysis (using McNemar's test) of accuracy-only RLVR-trained MLLMs. This builds a strong case that a significant bottleneck for current models is indeed multimodal perception, not just high-level reasoning. 2. The proposed visual perception reward is intuitive and cleverly designed. By having an LLM judge responses against verifiable, extracted annotations rather than training a holistic reward model, the method directly targets the iden
1. Limited analysis of generalization: The model's strong generalization from geometry-only training (Geometry3K) to general-domain benchmarks (like MMMU and MMStar) is a key result, but it is not fully explained. The authors hypothesize that they are improving a foundational perception capability, but the link between 'perceiving geometry diagrams' and 'perceiving real-world images' could be strengthened. To make this claim more concrete, the authors could include: - Qualitative analysis on ge
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning
