Interleaved-Modal Chain-of-Thought

Jun Gao; Yongqi Li; Ziqiang Cao; Wenjie Li

arXiv:2411.19488·cs.CV·March 18, 2025

Interleaved-Modal Chain-of-Thought

Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li

PDF

Open Access

TL;DR

This paper introduces Interleaved-modal Chain-of-Thought (ICoT), a multimodal reasoning approach for vision-language models that generates fine-grained visual-textual rationales, improving interpretability and performance.

Contribution

The paper proposes ICoT, a novel multimodal reasoning method with Attention-driven Selection (ADS), enabling VLMs to produce interleaved visual and textual rationales without additional training.

Findings

01

Achieves up to 14% performance improvement on benchmarks.

02

Enhances interpretability of VLM reasoning processes.

03

Demonstrates generalizability across different VLM architectures.

Abstract

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpinion Dynamics and Social Influence

MethodsSoftmax · Attention Is All You Need · Chain-of-thought prompting