Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive   Dataset and Benchmark for Chain-of-Thought Reasoning

Hao Shao; Shengju Qian; Han Xiao; Guanglu Song; Zhuofan Zong; Letian; Wang; Yu Liu; Hongsheng Li

arXiv:2403.16999·cs.CV·November 5, 2024·5 cites

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian, Wang, Yu Liu, Hongsheng Li

PDF

Open Access 1 Repo 1 Models 4 Datasets

TL;DR

This paper introduces a large-scale dataset and benchmark for multi-modal language models, emphasizing interpretability and reasoning over complex visual inputs with a focus on local regions.

Contribution

It provides a comprehensive dataset with annotated reasoning steps and key regions, along with a multi-turn processing pipeline for improved interpretability and reasoning in MLLMs.

Findings

01

Enhanced model performance on local region identification tasks

02

Effective multi-turn reasoning improves interpretability

03

New benchmark facilitates evaluation of visual reasoning capabilities

Abstract

Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Additionally, about 98k pairs of them are annotated with detailed reasoning steps. Importantly, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We also introduce the related benchmark to evaluate the MLLMs in scenarios requiring specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepcs233/visual-cot
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling