A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering
Alireza Salemi, Juan Altmayer Pizzorno, Hamed Zamani

TL;DR
This paper introduces a symmetric dual encoding dense retrieval framework, DEDR, for knowledge-intensive visual question answering, improving retrieval and answer generation accuracy on key datasets through novel encoding and distillation techniques.
Contribution
The paper proposes DEDR, a novel symmetric dual encoding dense retrieval method with iterative knowledge distillation, and MM-FiD, a multi-modal fusion model for improved KI-VQA performance.
Findings
DEDR outperforms baselines by 11.6% on OK-VQA and 30.9% on FVQA.
MM-FiD improves question answering accuracy by 5.5% on OK-VQA and 8.5% on FVQA.
The approach effectively bridges representation gaps between encoders, enhancing retrieval and answer generation.
Abstract
Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively. Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
