Answer Mining from a Pool of Images: Towards Retrieval-Based Visual   Question Answering

Abhirama Subramanyam Penamakuri; Manish Gupta; Mithun Das Gupta; Anand; Mishra

arXiv:2306.16713·cs.CV·June 30, 2023

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand, Mishra

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new retrieval-based visual question answering task, RETVQA, along with a large dataset and a unified model, MI-BART, that effectively retrieves relevant images and generates accurate, fluent answers.

Contribution

The paper presents the RETVQA task, a large dataset, and a novel MI-BART model that jointly performs image retrieval and answer generation for complex visual questions.

Findings

01

MI-BART achieves 76.5% accuracy on RETVQA

02

Outperforms state-of-the-art on WebQA dataset by 4.9% accuracy

03

Model generates fluent, contextually relevant answers

Abstract

We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Abhiram4572/mi_bart
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Residual Connection · Softmax · Dense Connections · Dropout