Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and   Visual Question Generation

Kohei Uehara; Nabarun Goswami; Hanqin Wang; Toshiaki Baba; Kohtaro; Tanaka; Tomohiro Hashimoto; Kai Wang; Rei Ito; Takagi Naoya; Ryo Umagami,; Yingyi Wen; Tanachai Anakewat; Tatsuya Harada

arXiv:2401.10005·cs.CV·July 19, 2024·2 cites

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro, Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami,, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

PDF

Open Access

TL;DR

This paper introduces a novel vision-language model that employs explicit chain-of-reasoning and visual question generation, improving interpretability and robustness in visual content understanding.

Contribution

The paper develops a new dataset and training approach enabling VLMs to generate questions and perform iterative reasoning for enhanced interpretability.

Findings

01

Improved reasoning accuracy in VLMs

02

Enhanced robustness through question-asking mechanism

03

Demonstrated effectiveness on diverse visual tasks

Abstract

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsFocus