Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models
Wenjie Luo, Ruocheng Li, Shanshan Zhu, Julian Perry

TL;DR
This paper introduces CMRF, a novel multimodal reasoning framework that improves vision-language models' complex reasoning by iterative self-evaluation, decomposition, and correction, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper presents CMRF, a new multimodal reasoning framework with iterative self-evaluation and problem decomposition modules, enhancing LVLMs' reasoning accuracy and coherence.
Findings
Achieves 69.4% accuracy on VCR, surpassing baselines by 2.4 percentage points.
Outperforms existing open-source LVLMs on A-OKVQA and DailyLife-MRC benchmarks.
Ablation studies confirm the effectiveness of each module and iterative refinement.
Abstract
Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
