Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

Wenjie Luo; Ruocheng Li; Shanshan Zhu; Julian Perry

arXiv:2508.02886·cs.CL·August 6, 2025

Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

Wenjie Luo, Ruocheng Li, Shanshan Zhu, Julian Perry

PDF

TL;DR

This paper introduces CMRF, a novel multimodal reasoning framework that improves vision-language models' complex reasoning by iterative self-evaluation, decomposition, and correction, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper presents CMRF, a new multimodal reasoning framework with iterative self-evaluation and problem decomposition modules, enhancing LVLMs' reasoning accuracy and coherence.

Findings

01

Achieves 69.4% accuracy on VCR, surpassing baselines by 2.4 percentage points.

02

Outperforms existing open-source LVLMs on A-OKVQA and DailyLife-MRC benchmarks.

03

Ablation studies confirm the effectiveness of each module and iterative refinement.

Abstract

Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.