GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Mohammad Mahdi Moradi, Sudhir Mudur

TL;DR
This paper introduces GC-KBVQA, a four-stage framework that enhances knowledge-based visual question answering by grounding question-aware captions and integrating external knowledge, enabling effective zero-shot performance without task-specific training.
Contribution
The paper proposes a novel four-stage framework that improves KB-VQA by grounding captions and leveraging external knowledge, eliminating the need for end-to-end multimodal training.
Findings
Significantly improved performance over existing KB-VQA methods.
Effective zero-shot VQA without task-specific fine-tuning.
Reduces costs and complexity by using pre-trained LLMs.
Abstract
Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Text and Document Classification Technologies
MethodsFocus
