Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Elham J. Barezi; Parisa Kordjamshidi

arXiv:2404.10226·cs.AI·April 17, 2024·1 cites

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Elham J. Barezi, Parisa Kordjamshidi

PDF

Open Access

TL;DR

This paper investigates how neural architectures and large language models can be used for knowledge-based visual question answering, emphasizing the roles of explicit knowledge retrieval and the limitations of LLMs in multi-hop reasoning.

Contribution

It provides a comparative analysis of task-specific neural models and LLMs for KB-VQA, highlighting the benefits of explicit knowledge retrieval and the limitations of LLMs in multi-hop reasoning.

Findings

01

LLMs excel at 1-hop reasoning but struggle with 2-hop reasoning.

02

Explicit knowledge retrieval improves model performance.

03

LLMs outperform neural models on KB-related questions, but still rely on external KB.

Abstract

We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Speech and dialogue systems

MethodsBalanced Selection