APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

Minjie Hong; Zirun Guo; Yan Xia; Zehan Wang; Ziang Zhang; Tao Jin; Zhou Zhao

arXiv:2506.21655·cs.LG·June 30, 2025

APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao

PDF

Open Access

TL;DR

This paper introduces Asymmetric Policy Optimization (APO), a novel training method for multimodal large language models that improves reasoning ability by dynamically balancing exploration and overthinking, leading to better performance and generalization.

Contribution

The paper proposes APO with DADS and STCR techniques to enhance reasoning in MLLMs, addressing overthinking and stability issues during reinforcement learning.

Findings

01

View-R1-3B improves reasoning by 7% over base models.

02

Outperforms larger MLLMs on reasoning benchmarks.

03

Maintains general task performance while enhancing reasoning.

Abstract

Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics