Domain-robust VQA with diverse datasets and methods but no target labels
Mingda Zhang, Tristan Maidment, Ahmad Diab, Adriana Kovashka, Rebecca, Hwa

TL;DR
This paper investigates the robustness of various VQA models to domain shifts across datasets, quantifies these shifts, and proposes a new domain adaptation method tailored for VQA's unique challenges.
Contribution
It introduces a comprehensive analysis of domain shifts in VQA, evaluates existing models' robustness, and develops a novel domain adaptation approach specific to VQA complexities.
Findings
VQA datasets exhibit significant domain shifts in visual and textual modalities.
Transformer-based VQA models show greater robustness to domain shifts.
The proposed domain adaptation method improves VQA performance across different datasets.
Abstract
The observation that computer vision methods overfit to dataset specifics has inspired diverse attempts to make object recognition models robust to domain shifts. However, similar work on domain-robust visual question answering methods is very limited. Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity: VQA models handle multimodal inputs, methods contain multiple steps with diverse modules resulting in complex optimization, and answer spaces in different datasets are vastly different. To tackle these challenges, we first quantify domain shifts between popular VQA datasets, in both visual and textual space. To disentangle shifts between datasets arising from different modalities, we also construct synthetic shifts in the image and question domains separately. Second, we test the robustness of different families of VQA methods (classic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
