Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li; Wenbo Ye; Zhen Li; Yuwei Wu; Yunde Jia

arXiv:2505.23045·cs.CV·May 30, 2025

Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia

PDF

Open Access 1 Repo

TL;DR

This paper investigates multi-sourced compositional generalization in visual question answering, proposing a retrieval-augmented training framework to improve models' ability to generalize to novel multi-modal compositions.

Contribution

It introduces a novel retrieval-augmented training method and a new dataset to evaluate multi-sourced compositional generalization in VQA.

Findings

01

The proposed framework improves generalization to multi-sourced novel compositions.

02

The GQA-MSCG dataset effectively evaluates MSCG in VQA.

03

Experimental results show enhanced performance over baseline models.

Abstract

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nevermorelch/mscg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need