Aligning Multimodal Representations through an Information Bottleneck

Antonio Almud\'evar; Jos\'e Miguel Hern\'andez-Lobato; Sameer Khurana; Ricard Marxer; Alfonso Ortega

arXiv:2506.04870·cs.LG·June 6, 2025

Aligning Multimodal Representations through an Information Bottleneck

Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

PDF

Open Access 1 Video

TL;DR

This paper investigates why contrastive losses often fail to produce aligned multimodal representations, attributing the issue to modality-specific information, and proposes an information bottleneck-based regularization to improve alignment.

Contribution

It provides a theoretical analysis of modality-specific information in contrastive learning and introduces a novel regularization method to enhance multimodal representation alignment.

Findings

01

Regularization improves representation alignment in experiments.

02

Hyperparameter tuning affects the emergence of modality-specific information.

03

The proposed method outperforms baseline contrastive approaches.

Abstract

Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Aligning Multimodal Representations through an Information Bottleneck· slideslive

Taxonomy

TopicsFace Recognition and Perception · Child and Animal Learning Development · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training