Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang; Jianzhe Liu; Zhen Han; Shuo Chen; Bailan He; Volker; Tresp; Zhiqiang Xu; Jindong Gu

arXiv:2409.19339·cs.CL·October 8, 2024

Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker, Tresp, Zhiqiang Xu, Jindong Gu

PDF

Open Access 3 Models 1 Video

TL;DR

This paper evaluates and improves the question decomposition ability of Multimodal Large Language Models (MLLMs) for visual questions, introducing a new dataset and finetuning method that enhance their performance and accuracy.

Contribution

The paper introduces DecoVQA+ dataset and a finetuning pipeline to significantly improve MLLMs' visual question decomposition capabilities.

Findings

01

Enhanced sub-question quality after finetuning.

02

Improved accuracy on VQA benchmarks.

03

Effective selective decomposition policy achieved.

Abstract

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Visual Question Decomposition on Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsFocus