Passage Retrieval for Outside-Knowledge Visual Question Answering

Chen Qu; Hamed Zamani; Liu Yang; W. Bruce Croft; Erik; Learned-Miller

arXiv:2105.03938·cs.IR·May 11, 2021

Passage Retrieval for Outside-Knowledge Visual Question Answering

Chen Qu, Hamed Zamani, Liu Yang, W. Bruce Croft, Erik, Learned-Miller

PDF

1 Repo

TL;DR

This paper explores passage retrieval methods for outside-knowledge visual question answering, demonstrating that dense retrieval with multi-modal encoders outperforms sparse methods and matches human-caption-based retrieval.

Contribution

It introduces a dual-encoder dense retrieval approach using LXMERT for multi-modal queries, improving retrieval performance over traditional sparse methods.

Findings

01

Dense retrieval outperforms sparse retrieval with object expansion.

02

Captions are more informative than object names for retrieval.

03

Dense retrieval matches the performance of human-caption-based methods.

Abstract

In this work, we address multi-modal information needs that contain text questions and images by focusing on passage retrieval for outside-knowledge visual question answering. This task requires access to outside knowledge, which in our case we define to be a large unstructured passage collection. We first conduct sparse retrieval with BM25 and study expanding the question with object names and image captions. We verify that visual clues play an important role and captions tend to be more informative than object names in sparse retrieval. We then construct a dual-encoder dense retriever, with the query encoder being LXMERT, a multi-modal pre-trained transformer. We further show that dense retrieval significantly outperforms sparse retrieval that uses object expansion. Moreover, dense retrieval matches the performance of sparse retrieval that leverages human-generated captions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

prdwb/okvqa-release
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLearning Cross-Modality Encoder Representations from Transformers