CMAL: A Novel Cross-Modal Associative Learning Framework for   Vision-Language Pre-Training

Zhiyuan Ma; Jianjun Li; Guohui Li; Kaiyan Huang

arXiv:2410.12595·cs.CV·October 17, 2024

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Zhiyuan Ma, Jianjun Li, Guohui Li, Kaiyan Huang

PDF

Open Access

TL;DR

CMAL introduces a cross-modal associative learning framework for vision-language pre-training that leverages anchor points and self-supervised mapping to improve performance with less data, outperforming contrastive methods.

Contribution

The paper proposes a novel CMAL framework that incorporates anchor point detection and associative learning, addressing limitations of contrastive learning in vision-language pre-training.

Findings

01

Achieves competitive results on four downstream tasks.

02

Sets new state-of-the-art on SNLI-VE and REC datasets.

03

Requires significantly less training data than previous methods.

Abstract

With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Focus · Contrastive Learning