CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma, Jianjun Li, Guohui Li, Kaiyan Huang

TL;DR
CMAL introduces a cross-modal associative learning framework for vision-language pre-training that leverages anchor points and self-supervised mapping to improve performance with less data, outperforming contrastive methods.
Contribution
The paper proposes a novel CMAL framework that incorporates anchor point detection and associative learning, addressing limitations of contrastive learning in vision-language pre-training.
Findings
Achieves competitive results on four downstream tasks.
Sets new state-of-the-art on SNLI-VE and REC datasets.
Requires significantly less training data than previous methods.
Abstract
With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need · Focus · Contrastive Learning
