Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu

TL;DR
This paper addresses the challenge of noisy supervision in cross-modality contrastive learning by proposing gradient harmonization techniques, leading to improved multimodal pre-training performance and scalability.
Contribution
It introduces gradient realignment and curriculum learning methods to mitigate gradient conflicts, enhancing the effectiveness of large-scale multimodal pre-training.
Findings
Improved downstream task performance with gradient harmonization.
Scalability to larger, more heterogeneous datasets.
Reduction of gradient conflicts during training.
Abstract
Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · Contrastive Learning · VATT
