Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li; Hangyi Kuang; Hengrui Zhang; Jiangxia Cao; Zhaojie Liu; Qibin Hou; Ming-Ming Cheng

arXiv:2603.22847·cs.CV·March 25, 2026

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng

PDF

Open Access

TL;DR

This paper introduces PEPO, a token-level policy optimization method for multimodal reasoning that improves reasoning accuracy by focusing on perceptual grounding and inference dynamics, outperforming existing RL methods.

Contribution

The paper proposes PEPO, a novel token-level optimization approach that enhances multimodal reasoning by integrating perception priors with token entropy, compatible with existing RL frameworks.

Findings

01

PEPO achieves consistent improvements across multiple multimodal benchmarks.

02

PEPO maintains stable training dynamics while enhancing reasoning performance.

03

PEPO outperforms strong RL baselines in geometry, grounding, puzzles, and classification tasks.

Abstract

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning