Precision Empowers, Excess Distracts: Visual Question Answering With   Dynamically Infused Knowledge In Language Models

Manas Jhalani; Annervaz K M; Pushpak Bhattacharyya

arXiv:2406.09994·cs.CL·June 17, 2024·2 cites

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Manas Jhalani, Annervaz K M, Pushpak Bhattacharyya

PDF

Open Access

TL;DR

This paper enhances Visual Question Answering by dynamically infusing external knowledge from knowledge graphs into language models, significantly improving accuracy and reasoning capabilities across multiple datasets.

Contribution

Introduces a dynamic triple extraction method to incorporate relevant external knowledge into vision-language transformers for KBVQA, outperforming state-of-the-art models.

Findings

01

4.75% average improvement in Exact Match Score

02

Variable triples improve reasoning over fixed triples

03

State-of-the-art performance on small datasets with fine-tuning

Abstract

In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing natural language questions grounded in visual content. Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. We introduce an approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model. Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method. We supply a flexible number of triples from the knowledge graph as context, tailored to meet the requirements for answering the question. Our model, enriched with knowledge, demonstrates an average improvement of 4.75\% in Exact Match Score over the state-of-the-art on three different KBVQA datasets. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques