SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma

TL;DR
This paper introduces SVSR, a novel framework that enhances multimodal reasoning by integrating self-verification and self-rectification, leading to improved accuracy and robustness across various tasks.
Contribution
The paper proposes a three-stage training paradigm for self-reflective multimodal models, combining dataset refinement, supervised fine-tuning, and semi-online preference optimization.
Findings
SVSR improves reasoning accuracy across multiple benchmarks.
Models trained with SVSR show better generalization to unseen tasks.
Self-reflective training enhances implicit reasoning abilities.
Abstract
Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model's reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
