Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng

TL;DR
This paper introduces PEPO, a token-level policy optimization method for multimodal reasoning that improves reasoning accuracy by focusing on perceptual grounding and inference dynamics, outperforming existing RL methods.
Contribution
The paper proposes PEPO, a novel token-level optimization approach that enhances multimodal reasoning by integrating perception priors with token entropy, compatible with existing RL frameworks.
Findings
PEPO achieves consistent improvements across multiple multimodal benchmarks.
PEPO maintains stable training dynamics while enhancing reasoning performance.
PEPO outperforms strong RL baselines in geometry, grounding, puzzles, and classification tasks.
Abstract
Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
