MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang; Linjie Li; Jianfeng Wang; Kevin Lin; Ehsan Azarnasab,; Faisal Ahmed; Zicheng Liu; Ce Liu; Michael Zeng; Lijuan Wang

arXiv:2303.11381·cs.CV·March 22, 2023·79 cites

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab,, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

MM-REACT is a system that combines ChatGPT with vision experts using a novel prompt design to enable advanced multimodal reasoning and action in zero-shot scenarios.

Contribution

It introduces a new prompt-based system paradigm that allows language models to process and reason over multimodal visual signals without fine-tuning.

Findings

01

Effective zero-shot multimodal reasoning demonstrated

02

Versatile application across different visual understanding scenarios

03

Comparable or superior to fine-tuning approaches

Abstract

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/MM-REACT
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques