CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs

Yangshu Yuan; Heng Chen; Xinyi Jiang; Christian Ng; Kexin Qiu

arXiv:2507.22074·cs.LG·July 31, 2025

CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs

Yangshu Yuan, Heng Chen, Xinyi Jiang, Christian Ng, Kexin Qiu

PDF

TL;DR

CIMR introduces a novel iterative reasoning framework for LVLMs that enhances complex multi-modal instruction following through dynamic feedback integration and self-correction, significantly improving accuracy on challenging tasks.

Contribution

This paper presents CIMR, a new framework that enables iterative, context-aware reasoning and self-correction in LVLMs, improving their performance on complex multi-modal tasks.

Findings

01

CIMR achieves 91.5% accuracy on the MAP dataset.

02

Outperforms state-of-the-art models like GPT-4V and LLaVA-1.5.

03

Demonstrates the effectiveness of iterative reasoning and self-correction.

Abstract

The rapid advancement of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) has enhanced our ability to process and generate human language and visual information. However, these models often struggle with complex, multi-step multi-modal instructions that require logical reasoning, dynamic feedback integration, and iterative self-correction. To address this, we propose CIMR: Contextualized Iterative Multimodal Reasoning, a novel framework that introduces a context-aware iterative reasoning and self-correction module. CIMR operates in two stages: initial reasoning and response generation, followed by iterative refinement using parsed multi-modal feedback. A dynamic fusion module deeply integrates textual, visual, and contextual features at each step. We fine-tune LLaVA-1.5-7B on the Visual Instruction Tuning (VIT) dataset and evaluate CIMR on the newly introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.