Generative Visual Question Answering

Ethan Shen; Scotty Singh; Bhavesh Kumar

arXiv:2307.10405·cs.CV·July 21, 2023

Generative Visual Question Answering

Ethan Shen, Scotty Singh, Bhavesh Kumar

PDF

Open Access

TL;DR

This paper introduces GenVQA, a new dataset for evaluating the temporal generalization of Visual Question Answering models, and analyzes model robustness to future data distribution shifts using augmented images.

Contribution

The paper presents GenVQA, a novel dataset generated with stable diffusion, to test and improve the temporal robustness of VQA models beyond their training data.

Findings

01

Models show varying robustness to temporal shifts

02

Certain architectural choices improve generalization

03

Augmented datasets enhance model adaptability

Abstract

Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization. We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion. This augmented dataset is then used to test a combination of seven baseline and cutting edge VQA models. Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images. This paper's purpose is to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning