More Than Just Attention: Improving Cross-Modal Attentions with   Contrastive Constraints for Image-Text Matching

Yuxiao Chen; Jianbo Yuan; Long Zhao; Tianlang Chen; Rui Luo; Larry; Davis; Dimitris N. Metaxas

arXiv:2105.09597·cs.CV·October 5, 2022·1 cites

More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

Yuxiao Chen, Jianbo Yuan, Long Zhao, Tianlang Chen, Rui Luo, Larry, Davis, Dimitris N. Metaxas

PDF

Open Access

TL;DR

This paper introduces contrastive training strategies to enhance cross-modal attention in image-text matching, leading to improved retrieval accuracy and attention quality without needing explicit attention annotations.

Contribution

The authors propose two novel contrastive constraints, CCR and CCS, that can be integrated into existing models to improve attention supervision and matching performance.

Findings

01

Enhanced retrieval performance on Flickr30k and MS-COCO datasets.

02

Improved attention quality measured by new attention metrics.

03

Compatibility with multiple state-of-the-art models.

Abstract

Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning