The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis; Michaela Areti Zervou; Konstantinos Kontras; Maarten De Vos; Panagiotis Tsakalides; Grigorios Tsagkatakis

arXiv:2511.21331·cs.CV·April 6, 2026

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagkatakis

PDF

1 Repo

TL;DR

The paper introduces Contrastive Fusion (ConFu), a novel framework for higher-order multimodal alignment that jointly embeds individual modalities and their fused combinations to better capture complex dependencies.

Contribution

ConFu extends contrastive learning to include fused modality representations, enabling higher-order dependency modeling while maintaining pairwise relationships.

Findings

01

ConFu captures higher-order dependencies like XOR relationships.

02

It demonstrates competitive performance on retrieval and classification tasks.

03

Supports unified one-to-one and two-to-one retrieval within a single framework.

Abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

estafons/confu
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.