A Symmetric Dual Encoding Dense Retrieval Framework for   Knowledge-Intensive Visual Question Answering

Alireza Salemi; Juan Altmayer Pizzorno; Hamed Zamani

arXiv:2304.13649·cs.CV·April 27, 2023·1 cites

A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

Alireza Salemi, Juan Altmayer Pizzorno, Hamed Zamani

PDF

Open Access 1 Repo

TL;DR

This paper introduces a symmetric dual encoding dense retrieval framework, DEDR, for knowledge-intensive visual question answering, improving retrieval and answer generation accuracy on key datasets through novel encoding and distillation techniques.

Contribution

The paper proposes DEDR, a novel symmetric dual encoding dense retrieval method with iterative knowledge distillation, and MM-FiD, a multi-modal fusion model for improved KI-VQA performance.

Findings

01

DEDR outperforms baselines by 11.6% on OK-VQA and 30.9% on FVQA.

02

MM-FiD improves question answering accuracy by 5.5% on OK-VQA and 8.5% on FVQA.

03

The approach effectively bridges representation gaps between encoders, enhancing retrieval and answer generation.

Abstract

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively. Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alirezasalemi7/dedr-mm-fid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation