GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

Mohammad Mahdi Moradi; Sudhir Mudur

arXiv:2505.19354·cs.CL·May 27, 2025

GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

Mohammad Mahdi Moradi, Sudhir Mudur

PDF

Open Access

TL;DR

This paper introduces GC-KBVQA, a four-stage framework that enhances knowledge-based visual question answering by grounding question-aware captions and integrating external knowledge, enabling effective zero-shot performance without task-specific training.

Contribution

The paper proposes a novel four-stage framework that improves KB-VQA by grounding captions and leveraging external knowledge, eliminating the need for end-to-end multimodal training.

Findings

01

Significantly improved performance over existing KB-VQA methods.

02

Effective zero-shot VQA without task-specific fine-tuning.

03

Reduces costs and complexity by using pre-trained LLMs.

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Text and Document Classification Technologies

MethodsFocus