Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal   Large Language Models

Qiji Zhou; Ruochen Zhou; Zike Hu; Panzhong Lu; Siyang Gao; Yue Zhang

arXiv:2405.13872·cs.AI·May 30, 2024

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang

PDF

Open Access

TL;DR

The paper introduces Image-of-Thought prompting, a novel method enabling multimodal large language models to extract and refine visual rationales step-by-step, significantly enhancing zero-shot visual reasoning and interpretability.

Contribution

It proposes the IoT prompting technique that automatically designs visual information extraction steps, integrating visual and textual rationales for improved multimodal reasoning.

Findings

01

Improved zero-shot visual reasoning performance across tasks

02

Enhanced interpretability through step-by-step visual explanations

03

Effective in various multimodal large language models

Abstract

Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. With the evolution of Multimodal Large Language Models (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT has yet to be thoroughly investigated. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. Specifically, IoT prompting can automatically design critical visual information extraction operations based on the input images and questions. Each step of visual information refinement identifies specific visual rationales that support answers to complex visual reasoning questions. Beyond the textual CoT, IoT simultaneously utilizes visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling