Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang

TL;DR
This paper introduces SR-MCR, a novel framework that enhances multimodal large language models' reasoning coherence and visual grounding by leveraging intrinsic process signals and a self-reward mechanism, leading to improved accuracy.
Contribution
It proposes a lightweight, label-free self-rewarded training method that aligns reasoning steps using intrinsic signals, improving multimodal LLMs without additional supervision.
Findings
SR-MCR outperforms existing models on visual reasoning benchmarks.
Achieves 81.4% accuracy with a 7B parameter model.
Ablation studies validate the effectiveness of each reward component.
Abstract
Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
