Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations
Carolin Cissee, Raneen Younis, Zahra Ahmadi

TL;DR
COrAL is a novel multimodal contrastive learning framework that explicitly disentangles shared, unique, and synergistic information using orthogonality constraints and asymmetric masking, leading to more stable and comprehensive representations.
Contribution
The paper introduces COrAL, a framework that explicitly models all information components in multimodal data, improving representation quality and stability over existing methods.
Findings
Outperforms state-of-the-art on synthetic and real datasets
Achieves lower variance in performance across runs
Produces more stable and comprehensive multimodal embeddings
Abstract
Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
