Visual Question Decomposition on Multimodal Large Language Models
Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker, Tresp, Zhiqiang Xu, Jindong Gu

TL;DR
This paper evaluates and improves the question decomposition ability of Multimodal Large Language Models (MLLMs) for visual questions, introducing a new dataset and finetuning method that enhance their performance and accuracy.
Contribution
The paper introduces DecoVQA+ dataset and a finetuning pipeline to significantly improve MLLMs' visual question decomposition capabilities.
Findings
Enhanced sub-question quality after finetuning.
Improved accuracy on VQA benchmarks.
Effective selective decomposition policy achieved.
Abstract
Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsFocus
