Find The Gap: Knowledge Base Reasoning For Visual Question Answering
Elham J. Barezi, Parisa Kordjamshidi

TL;DR
This paper investigates how neural architectures and large language models can be used for knowledge-based visual question answering, emphasizing the roles of explicit knowledge retrieval and the limitations of LLMs in multi-hop reasoning.
Contribution
It provides a comparative analysis of task-specific neural models and LLMs for KB-VQA, highlighting the benefits of explicit knowledge retrieval and the limitations of LLMs in multi-hop reasoning.
Findings
LLMs excel at 1-hop reasoning but struggle with 2-hop reasoning.
Explicit knowledge retrieval improves model performance.
LLMs outperform neural models on KB-related questions, but still rely on external KB.
Abstract
We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Speech and dialogue systems
MethodsBalanced Selection
