Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
Diego Garcia-Olano, Yasumasa Onoe, Joydeep Ghosh

TL;DR
This paper explores how injecting entity-enhanced knowledge graph embeddings into a bi-modal VQA system improves performance on KBVQA tasks, especially with rare entities, without extra pre-training.
Contribution
It demonstrates effective knowledge injection methods in a bi-modal setting, analyzing their impact on VQA performance and interpretability, using publicly available datasets.
Findings
Significant performance gains on KBVQA datasets
Knowledge injection benefits rare entity questions
Improved explanations with entity-enhanced representations
Abstract
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system's performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
