Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, Dacheng Tao

TL;DR
This paper introduces a cross-modal contrastive learning approach that enhances robustness in visual question answering by reducing shortcut reasoning and leveraging fine-grained language-image correspondences.
Contribution
It proposes a novel contrastive learning strategy that avoids complex negative sampling and uses graph-based image relationships to improve VQA reasoning robustness.
Findings
Outperforms state-of-the-art on multiple VQA datasets.
Reduces reliance on shortcut reasoning in VQA models.
Demonstrates effectiveness of graph-based negative sampling.
Abstract
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
