Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

TL;DR
This paper introduces a new information-theoretic approach and a policy optimization method called RAPO to improve visual retention in long-chain multimodal reasoning tasks, leading to better model performance.
Contribution
It derives a lower bound on visual gain for interventions and proposes RAPO, which selects high-entropy reflection anchors to enhance visual information propagation.
Findings
RAPO significantly improves performance on reasoning benchmarks.
Reflection anchors are enriched at visually sensitive decision points.
RAPO increases visual-dependence signals along reasoning trajectories.
Abstract
Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
