Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei, Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai

TL;DR
This paper introduces a preference optimization method to improve multimodal reasoning in large language models, significantly enhancing their Chain-of-Thought performance and achieving state-of-the-art results on reasoning benchmarks.
Contribution
The paper proposes Mixed Preference Optimization (MPO) and a large-scale multimodal reasoning dataset, improving reasoning capabilities of MLLMs beyond existing fine-tuning methods.
Findings
InternVL2-8B-MPO achieves 67.0 accuracy on MathVista.
MPO boosts multimodal Chain-of-Thought performance.
Model performance approaches that of much larger models.
Abstract
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenGVLab/InternVL3_5-38Bmodel· 7.5k dl· ♡ 437.5k dl♡ 43
- 🤗OpenGVLab/InternVL3_5-8Bmodel· 46k dl· ♡ 9646k dl♡ 96
- 🤗OpenGVLab/InternVL3-78Bmodel· 40k dl· ♡ 23340k dl♡ 233
- 🤗OpenGVLab/InternVL3_5-241B-A28Bmodel· 430 dl· ♡ 136430 dl♡ 136
- 🤗OpenGVLab/InternVL3_5-30B-A3Bmodel· 109k dl· ♡ 42109k dl♡ 42
- 🤗OpenGVLab/InternVL3_5-38B-Instructmodel· 1.2k dl· ♡ 61.2k dl♡ 6
- 🤗OpenGVLab/InternVL2-8B-MPOmodel· 71 dl· ♡ 3771 dl♡ 37
- 🤗OpenGVLab/InternVL2_5-78B-MPOmodel· 53 dl· ♡ 5453 dl♡ 54
- 🤗OpenGVLab/InternVL2_5-38B-MPOmodel· 34 dl· ♡ 2034 dl♡ 20
- 🤗OpenGVLab/InternVL2_5-26B-MPOmodel· 1.7k dl· ♡ 141.7k dl♡ 14
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsParrot optimizer: Algorithm and applications to medical problems
