Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs
Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Guoyin Wang, Jiancan Wu, Xiang Wang, Xiangnan He

TL;DR
This paper introduces a token reweighting strategy for multimodal large language models that improves visual grounding and reasoning by explicitly modeling the interdependence of perception and reasoning tokens during reinforcement learning.
Contribution
It proposes a novel plug-and-play token reweighting method that enhances RLVR training for multimodal LLMs, achieving state-of-the-art results.
Findings
Reweighting critical tokens improves model performance.
Explicit modeling of perception and reasoning tokens enhances visual grounding.
Method achieves state-of-the-art results on multiple benchmarks.
Abstract
Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of…
Peer Reviews
Decision·Submitted to ICLR 2026
- This work presents an interesting token-level analysis that classifies MLLM outputs into perception-related and reasoning-related tokens, empirically demonstrating their influence in RLVR optimization. - The proposed ToR strategy achieves meaningful performance improvements over baseline GRPO and DAPO methods, establishing new state-of-the-art results on several multimodal reasoning and perception benchmarks in a data-efficient manner.
- My main concern lies in the logical foundation of the motivational study. The paper claims that optimizing only reasoning-tokens or only perception-tokens underperforms and that this proves their "interdependence." This conclusion appears to be a logical leap. The experiments merely show that partially disabling the model (i.e., zeroing out gradients for certain tokens) leads to performance degradation, which is an intuitive outcome. These quantitative results do not rigorously demonstrate why
1. The idea of distinguishing reasoning-related and perception-related tokens is novel and well-motivated, as the varying importance of different tokens for reasoning and perception objectively exists. 2. The formulation for reasoning-related tokens is well-grounded, and the notion itself is consistent with the modeling objective (In contrast, the other concept lacks such alignment, as will be discussed in the weakness section). 3. In experiment, the parameter sensitivity of the proposed module
1. The design of the identified measure for perception-related tokens is not theoretically sound. The visual sensitivity score can be rewritten as $S _{i,t} ^b=|\log \frac{\pi _{\theta}(o _{i,t} ^b|\mathbf{o} _{i,<t} ^b,I _{i} ^b,q _{i} ^b)}{\pi _{\theta}(o _{i,t} ^b|\mathbf{o} _{i,<t} ^b,\emptyset,q _{i} ^b)}|$, which indicates that the measure directly depends on the log-ratio between $\pi _{\theta}(o _{i,t} ^b|\mathbf{o} _{i,<t} ^b,\emptyset,q _{i} ^b)$ and $\pi _{\theta}(o _{i,t} ^b|\mathbf{
- The paper has a good motivation and targets the important problem of enhancing the model's multimodal reasoning capabilities. - The proposed ToR approach is novel and simple to implement. - The numerical experiments were compared with a comprehensive list of baselines, and the results feel convincing.
- The proposed method is not really plug and play (which means training-free in general), in the sense that it's a new RLVR approach that requires computing for model retraining. - The selection criterion in ToR feels very heuristic and non-principled in design. There are at least four hyperparameters (selection quantile for reasoning/perception tokens and weight for reasoning/perception tokens), which make tuning in practice difficult. - ToR considers a constant weight for the selected token.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
