Multi-Sourced Compositional Generalization in Visual Question Answering
Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia

TL;DR
This paper investigates multi-sourced compositional generalization in visual question answering, proposing a retrieval-augmented training framework to improve models' ability to generalize to novel multi-modal compositions.
Contribution
It introduces a novel retrieval-augmented training method and a new dataset to evaluate multi-sourced compositional generalization in VQA.
Findings
The proposed framework improves generalization to multi-sourced novel compositions.
The GQA-MSCG dataset effectively evaluates MSCG in VQA.
Experimental results show enhanced performance over baseline models.
Abstract
Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
