Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Lingzhuang Sun; Ruitong Liu; Yuxia Zhu; Xiaohan Xu; Jingxuan Wei; Xiangxiang Zhang; Bihui Yu; Wentao Zhang

arXiv:2602.04290·cs.CL·February 5, 2026

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

PDF

Open Access

TL;DR

This paper introduces the Guided Verifier framework that enhances multimodal reasoning in large language models by integrating real-time dynamic verification, reducing errors, and improving performance through collaborative inference.

Contribution

It proposes a novel dynamic verifier that actively co-solves tasks with the policy model, enabling real-time error detection and correction during reasoning processes.

Findings

01

Achieves improved reasoning accuracy on MathVista, MathVerse, and MMMU datasets.

02

Enables an 8B-parameter model to perform competitively with larger models.

03

Develops CoRe dataset for training the guided verifier with process-level negatives.

Abstract

Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques