Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand, Mishra

TL;DR
This paper introduces a new retrieval-based visual question answering task, RETVQA, along with a large dataset and a unified model, MI-BART, that effectively retrieves relevant images and generates accurate, fluent answers.
Contribution
The paper presents the RETVQA task, a large dataset, and a novel MI-BART model that jointly performs image retrieval and answer generation for complex visual questions.
Findings
MI-BART achieves 76.5% accuracy on RETVQA
Outperforms state-of-the-art on WebQA dataset by 4.9% accuracy
Model generates fluent, contextually relevant answers
Abstract
We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Residual Connection · Softmax · Dense Connections · Dropout
