Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Qi Zheng; Chaoyue Wang; Daqing Liu; Dadong Wang; Dacheng Tao

arXiv:2211.11190·cs.CV·November 22, 2022

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-modal contrastive learning approach that enhances robustness in visual question answering by reducing shortcut reasoning and leveraging fine-grained language-image correspondences.

Contribution

It proposes a novel contrastive learning strategy that avoids complex negative sampling and uses graph-based image relationships to improve VQA reasoning robustness.

Findings

01

Outperforms state-of-the-art on multiple VQA datasets.

02

Reduces reliance on shortcut reasoning in VQA models.

03

Demonstrates effectiveness of graph-based negative sampling.

Abstract

Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qizhust/cmcl_vqa_pl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning