What to align in multimodal contrastive learning?
Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, Jean-Philippe, Thiran

TL;DR
This paper introduces CoMM, a novel contrastive learning method for multimodal data that captures redundant, unique, and synergistic information, surpassing traditional shared-representation approaches and achieving state-of-the-art results on multiple benchmarks.
Contribution
CoMM enables multimodal interaction modeling by maximizing mutual information between augmented features, allowing for richer representations beyond shared information.
Findings
CoMM effectively captures redundant, unique, and synergistic multimodal information.
It achieves state-of-the-art performance on seven multimodal benchmarks.
Theoretical analysis confirms the emergence of different information terms from the formulation.
Abstract
Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these…
Peer Reviews
Decision·ICLR 2025 Poster
This paper grounds multimodal representation learning in the theoretical framework of multimodal information theory. Lemma 2 & 3 offer insight into novel methods that can be applied to learn multimodal representations with beyond redundancy.
I have strong doubts about the soundness of the theoretical foundation of this paper: 1. Assumption 1 assumes the existence of a multimodal augmentation $t$ such that $I(X; t(X)) = I(X; Y)$. This means that if $X$ contains more information than the label $Y$, $t(X)$ has to "reduce" the information in $X$ down to $Y$. This is different from just "label-preserving multimodal augmentation". The experiments section also offers little insight about how exactly this augmentation is carried out. The au
1. **Novel multimodal interaction learning under a contrastive learning framework**: the proposed CoMM framework is both theoretically and empirically shown to be able to capture uniqueness and (especially) synergy better than prior work. 2. **Comprehensive evaluation and careful ablation**: the proposed method is evaluated on a wide range of multimodal benchmarks, which involves diverse tasks, all requiring different level modeling of interactions. The paper conducts careful ablation in analyzi
1. **Need for clarification about the Multi-view redundancy assumption**: the paper introduces the Multi-view redundancy assumption (Definition 1) to highlight the insufficiency of several existing works that propose cross-modal contrastive learning. Specifically, the paper shows that under this assumption, which states that "most task-relevant information is shared across modalities", cross-modal contrastive learning is primarily learning the redundancy interaction while ignoring the others int
- **Originality**: - As highlighted by the authors, the use of self-supervision contrastive objectives to learn multimodal representations is not particularly novel (see, for example [1-4] for examples in fully-training and fine-tuning settings). However, the authors focus on the problem of learning not only redundant interactions between the modalities, but also the unique and synergetic interactions. This work seems to follow in line with previous work (FactorCL [6]), extending it to account
My two main concerns with the current version of the work are the following: - From an architectural point-of-view, the work presents little novelty: the multimodal fusion mechanism, based on concatenation and a transformer block, is also not particularly novel (see, for example, [1], where the authors discuss several works that employ concatenation for multimodal inputs in transformer architectures). Also, despite introducing some novel ideas, the work is still a continuation of FactorCL [2], w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning · Second Language Learning and Teaching · Second Language Acquisition and Learning
MethodsContrastive Learning · ALIGN
