MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan, Yang, Zenan Zhou, Wentao Zhang

TL;DR
This paper introduces MM-Verifier and MM-Reasoner, novel multimodal reasoning models enhanced by chain-of-thought verification, achieving state-of-the-art results on multiple benchmarks through data synthesis and fine-tuning techniques.
Contribution
The paper proposes a new two-step data synthesis method for multimodal verification and reasoning, significantly improving model performance and robustness in multimodal reasoning tasks.
Findings
MM-Verifier outperforms larger models on MathCheck, MathVista, and MathVerse.
MM-Reasoner shows strong scalability and effectiveness with increased data.
Combined MM-Reasoner and MM-Verifier surpass GPT-4o in accuracy on MathVista.
Abstract
According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
