Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
Jialin Wu, Raymond J. Mooney

TL;DR
This paper introduces EnFoRe, an entity-focused retrieval model that enhances knowledge retrieval for outside-knowledge visual question answering, leading to improved accuracy and state-of-the-art results.
Contribution
The paper proposes a novel entity-focused retrieval approach with stronger supervision, improving knowledge relevance in OK-VQA systems.
Findings
EnFoRe achieves superior retrieval performance on OK-VQA.
Combining EnFoRe with VQA models yields new state-of-the-art results.
Entity recognition improves knowledge specificity and answer accuracy.
Abstract
Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
