Generative Visual Question Answering
Ethan Shen, Scotty Singh, Bhavesh Kumar

TL;DR
This paper introduces GenVQA, a new dataset for evaluating the temporal generalization of Visual Question Answering models, and analyzes model robustness to future data distribution shifts using augmented images.
Contribution
The paper presents GenVQA, a novel dataset generated with stable diffusion, to test and improve the temporal robustness of VQA models beyond their training data.
Findings
Models show varying robustness to temporal shifts
Certain architectural choices improve generalization
Augmented datasets enhance model adaptability
Abstract
Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization. We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion. This augmented dataset is then used to test a combination of seven baseline and cutting edge VQA models. Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images. This paper's purpose is to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
