Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Tillmann Rheude, Stefan Hegselmann, Roland Eils, Benjamin Wild

TL;DR
This paper identifies fragility in multimodal contrastive learning due to multiplicative interactions and proposes Gated Symile, a gating mechanism, to improve robustness and retrieval accuracy across multiple modalities.
Contribution
It introduces Gated Symile, a novel gating mechanism that enhances robustness in multimodal contrastive learning by mitigating the impact of unreliable or missing modalities.
Findings
Gated Symile outperforms state-of-the-art baselines on multiple datasets.
The gating mechanism effectively suppresses unreliable modality contributions.
The approach improves robustness in noisy, misaligned, or incomplete multimodal data.
Abstract
Contrastive learning has become a standard approach for unsupervised learning from paired data, as demonstrated by CLIP for image-text matching. However, many domains involve more than two modalities and require objectives that capture higher-order dependencies beyond pairwise alignment. Symile extends CLIP to this setting by replacing the dot product with the multilinear inner product (MIP) over modality embeddings. In this work, we show that there is a fragility which ishidden in the multiplicative interaction: a single weakly informative, misaligned, or missing modality can propagate through the objective and distort cross-modal retrieval scores. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
