Multi-Image Visual Question Answering
Harsh Raj, Janhavi Dadhania, Akhilesh Bhardwaj, Prabuchandran KJ

TL;DR
This paper conducts an empirical study on multi-image visual question answering, exploring feature extraction methods, proposing a new dataset, and benchmarking a model that achieves high accuracy on this task.
Contribution
It introduces a new multi-image VQA dataset, evaluates various feature extraction methods, and presents a model with notable accuracy improvements.
Findings
39% word accuracy on CLEVER+TinyImagenet
99% image accuracy on CLEVER+TinyImagenet
Resnet + RCNN features with BERT embeddings outperform previous methods
Abstract
While a lot of work has been done on developing models to tackle the problem of Visual Question Answering, the ability of these models to relate the question to the image features still remain less explored. We present an empirical study of different feature extraction methods with different loss functions. We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth, and benchmark our results on them. Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired from stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Average Pooling · Global Average Pooling · Batch Normalization · Max Pooling · 1x1 Convolution
