Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
Alireza Salemi, Mahta Rafiee, Hamed Zamani

TL;DR
This paper introduces a pre-training method for multi-modal dense retrievers in outside-knowledge visual question answering, significantly improving retrieval accuracy and zero-shot performance.
Contribution
It proposes an automatic data generation pipeline for pre-training passage retrieval models, enhancing retrieval effectiveness in OK-VQA tasks.
Findings
26.9% improvement in Precision@5 over state-of-the-art
Effective zero-shot retrieval performance
Enhanced retrieval accuracy for multi-modal queries
Abstract
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
