TL;DR
This paper explores passage retrieval methods for outside-knowledge visual question answering, demonstrating that dense retrieval with multi-modal encoders outperforms sparse methods and matches human-caption-based retrieval.
Contribution
It introduces a dual-encoder dense retrieval approach using LXMERT for multi-modal queries, improving retrieval performance over traditional sparse methods.
Findings
Dense retrieval outperforms sparse retrieval with object expansion.
Captions are more informative than object names for retrieval.
Dense retrieval matches the performance of human-caption-based methods.
Abstract
In this work, we address multi-modal information needs that contain text questions and images by focusing on passage retrieval for outside-knowledge visual question answering. This task requires access to outside knowledge, which in our case we define to be a large unstructured passage collection. We first conduct sparse retrieval with BM25 and study expanding the question with object names and image captions. We verify that visual clues play an important role and captions tend to be more informative than object names in sparse retrieval. We then construct a dual-encoder dense retriever, with the query encoder being LXMERT, a multi-modal pre-trained transformer. We further show that dense retrieval significantly outperforms sparse retrieval that uses object expansion. Moreover, dense retrieval matches the performance of sparse retrieval that leverages human-generated captions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLearning Cross-Modality Encoder Representations from Transformers
