Multi-Image Visual Question Answering

Harsh Raj; Janhavi Dadhania; Akhilesh Bhardwaj; Prabuchandran KJ

arXiv:2112.13706·cs.CV·February 8, 2022·1 cites

Multi-Image Visual Question Answering

Harsh Raj, Janhavi Dadhania, Akhilesh Bhardwaj, Prabuchandran KJ

PDF

Open Access 1 Repo

TL;DR

This paper conducts an empirical study on multi-image visual question answering, exploring feature extraction methods, proposing a new dataset, and benchmarking a model that achieves high accuracy on this task.

Contribution

It introduces a new multi-image VQA dataset, evaluates various feature extraction methods, and presents a model with notable accuracy improvements.

Findings

01

39% word accuracy on CLEVER+TinyImagenet

02

99% image accuracy on CLEVER+TinyImagenet

03

Resnet + RCNN features with BERT embeddings outperform previous methods

Abstract

While a lot of work has been done on developing models to tackle the problem of Visual Question Answering, the ability of these models to relate the question to the image features still remain less explored. We present an empirical study of different feature extraction methods with different loss functions. We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth, and benchmark our results on them. Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired from stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harshraj22/vqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Average Pooling · Global Average Pooling · Batch Normalization · Max Pooling · 1x1 Convolution