Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

Wenhao Yang; Yu Xia; Jinlong Huang; Shiyin Lu; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang; Yuchen Zhou; Xiaobo Xia; Yuanyu Wan; Lijun Zhang; Tat-Seng Chua

arXiv:2604.06777·cs.CV·April 9, 2026

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuchen Zhou, Xiaobo Xia, Yuanyu Wan, Lijun Zhang, Tat-Seng Chua

PDF

TL;DR

This paper introduces MAPO, a new training method for multimodal models that aligns visual actions with textual reasoning, improving accuracy in visual reasoning tasks by reducing reasoning-action discrepancies.

Contribution

MAPO is a novel approach that explicitly links visual actions with textual descriptions, enhancing multimodal reasoning and addressing the reasoning-action gap in large language models.

Findings

01

MAPO reduces gradient variance and improves training stability.

02

Models trained with MAPO outperform existing methods on visual reasoning benchmarks.

03

Explicit textual descriptions of visual content enhance reasoning accuracy.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.