CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs
Yangshu Yuan, Heng Chen, Xinyi Jiang, Christian Ng, Kexin Qiu

TL;DR
CIMR introduces a novel iterative reasoning framework for LVLMs that enhances complex multi-modal instruction following through dynamic feedback integration and self-correction, significantly improving accuracy on challenging tasks.
Contribution
This paper presents CIMR, a new framework that enables iterative, context-aware reasoning and self-correction in LVLMs, improving their performance on complex multi-modal tasks.
Findings
CIMR achieves 91.5% accuracy on the MAP dataset.
Outperforms state-of-the-art models like GPT-4V and LLaVA-1.5.
Demonstrates the effectiveness of iterative reasoning and self-correction.
Abstract
The rapid advancement of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) has enhanced our ability to process and generate human language and visual information. However, these models often struggle with complex, multi-step multi-modal instructions that require logical reasoning, dynamic feedback integration, and iterative self-correction. To address this, we propose CIMR: Contextualized Iterative Multimodal Reasoning, a novel framework that introduces a context-aware iterative reasoning and self-correction module. CIMR operates in two stages: initial reasoning and response generation, followed by iterative refinement using parsed multi-modal feedback. A dynamic fusion module deeply integrates textual, visual, and contextual features at each step. We fine-tune LLaVA-1.5-7B on the Visual Instruction Tuning (VIT) dataset and evaluate CIMR on the newly introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
