Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu; Yongheng Zhang; Qiguang Chen; Yao Li; Sheng Wang; Libo Qin

arXiv:2603.21754·cs.CV·March 24, 2026·AAAI

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin

PDF

Open Access

TL;DR

This paper introduces DaP-ICoT, a novel reasoning framework that dynamically and precisely integrates visual thoughts, significantly improving efficiency and coherence in multimodal reasoning tasks.

Contribution

The paper proposes a dynamic and precise visual thought integration method for ICoT, addressing static positioning and incoherent representations, leading to state-of-the-art performance and reduced token consumption.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Reduces token consumption by 72.6%.

03

Enhances reasoning efficiency and coherence.

Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis