Contrastive Cross-Modal Knowledge Sharing Pre-training for   Vision-Language Representation Learning and Retrieval

Keyu Wen; Zhenshan Tan; Qingrong Cheng; Cheng Chen; and Xiaodong Gu

arXiv:2207.00733·cs.CV·July 11, 2022

Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Keyu Wen, Zhenshan Tan, Qingrong Cheng, Cheng Chen, and Xiaodong Gu

PDF

Open Access

TL;DR

This paper introduces COOKIE, a contrastive cross-modal pre-training method that enhances vision-language representations by combining a double-stream structure with novel modules for semantic alignment and knowledge sharing, improving retrieval performance and efficiency.

Contribution

COOKIE innovatively integrates a weight-sharing transformer and contrastive learning modules into a double-stream framework to improve cross-modal and unimodal retrieval tasks.

Findings

01

Outperforms existing models in cross-modal retrieval tasks.

02

Achieves higher statistical indicators with improved calculation efficiency.

03

Enhances unimodal representation learning through cross-modal knowledge sharing.

Abstract

Recently, the cross-modal pre-training task has been a hotspot because of its wide application in various down-streaming researches including retrieval, captioning, question answering and so on. However, exiting methods adopt a one-stream pre-training model to explore the united vision-language representation for conducting cross-modal retrieval, which easily suffer from the calculation explosion. Moreover, although the conventional double-stream structures are quite efficient, they still lack the vital cross-modal interactions, resulting in low performances. Motivated by these challenges, we put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) to grasp the joint text-image representations. Structurally, COOKIE adopts the traditional double-stream structure because of the acceptable time consumption. To overcome the inherent defects of double-stream structure as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN