Analysis on Image Set Visual Question Answering

Abhinav Khattar; Aviral Joshi; Har Simrat Singh; Pulkit Goel; Rohit; Prakash Barnwal

arXiv:2104.00107·cs.CV·April 2, 2021

Analysis on Image Set Visual Question Answering

Abhinav Khattar, Aviral Joshi, Har Simrat Singh, Pulkit Goel, Rohit, Prakash Barnwal

PDF

Open Access

TL;DR

This paper explores multi-image visual question answering on the ISVQA dataset, proposing four approaches to enhance model performance by improving spatial awareness, reducing language bias, and refining counting capabilities.

Contribution

It introduces four novel methods to improve multi-image VQA performance, focusing on spatial, language, and counting aspects, with detailed analysis of language bias in the dataset.

Findings

01

Slight performance improvements over baseline models

02

Enhanced spatial awareness and color identification

03

Reduced language bias through adversarial regularization

Abstract

We tackle the challenge of Visual Question Answering in multi-image setting for the ISVQA dataset. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. Image set VQA, however, comprises of a set of images and requires finding connection between images, relate the objects across images based on these connections and generate a unified answer. In this report, we work with 4 approaches in a bid to improve the performance on the task. We analyse and compare our results with three baseline models - LXMERT, HME-VideoQA and VisualBERT - and show that our approaches can provide a slight improvement over the baselines. In specific, we try to improve on the spatial awareness of the model and help the model identify color using enhanced pre-training, reduce language dependence using adversarial regularization, and improve counting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLearning Cross-Modality Encoder Representations from Transformers · VisualBERT