Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Jesen Zhang; Ningyuan Liu; Kaitong Cai; Sidi Liu; Jing Yang; Ziliang Chen; Xiaofei Sun; Keze Wang

arXiv:2512.22545·cs.CV·December 30, 2025

Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang

PDF

Open Access

TL;DR

This paper introduces SR-MCR, a novel framework that enhances multimodal large language models' reasoning coherence and visual grounding by leveraging intrinsic process signals and a self-reward mechanism, leading to improved accuracy.

Contribution

It proposes a lightweight, label-free self-rewarded training method that aligns reasoning steps using intrinsic signals, improving multimodal LLMs without additional supervision.

Findings

01

SR-MCR outperforms existing models on visual reasoning benchmarks.

02

Achieves 81.4% accuracy with a 7B parameter model.

03

Ablation studies validate the effectiveness of each reward component.

Abstract

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling