Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision
Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

TL;DR
This paper introduces the Guided Verifier framework that enhances multimodal reasoning in large language models by integrating real-time dynamic verification, reducing errors, and improving performance through collaborative inference.
Contribution
It proposes a novel dynamic verifier that actively co-solves tasks with the policy model, enabling real-time error detection and correction during reasoning processes.
Findings
Achieves improved reasoning accuracy on MathVista, MathVerse, and MMMU datasets.
Enables an 8B-parameter model to perform competitively with larger models.
Develops CoRe dataset for training the guided verifier with process-level negatives.
Abstract
Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
