Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA
Elham J. Barezi, Parisa Kordjamshidi

TL;DR
This paper improves knowledge-based visual question answering by decomposing complex questions into simpler ones, enhancing information extraction and reasoning, leading to up to 2% accuracy gains on multiple datasets.
Contribution
It introduces a question decomposition approach that separates visual and non-visual reasoning, improving multi-hop question answering in KB-VQA tasks.
Findings
Decomposing questions improves accuracy on VQA datasets.
Using simpler questions enhances visual and knowledge reasoning.
Up to 2% accuracy improvement demonstrated.
Abstract
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Rough Sets and Fuzzy Logic · Bayesian Modeling and Causal Inference
