Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, Danilo, Comminiello

TL;DR
This paper introduces GRAM, a novel geometric measure for aligning multiple modalities in a shared space, improving multimodal understanding and achieving state-of-the-art results in retrieval and classification tasks.
Contribution
The paper proposes GRAM, a new geometric alignment measure that directly aligns multiple modalities in a high-dimensional space, overcoming limitations of pairwise approaches.
Findings
GRAM improves multimodal alignment quality.
State-of-the-art performance in retrieval and classification tasks.
Flexible for any number of modalities from 2 to n.
Abstract
Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns modalities directly in the higher-dimensional space in which modality…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper demonstrates a novel method to better align heterogeneous modalities in higher-dimensional geometric spaces, which take the sematic relation between multiple modalities into consideration. 2. It demonstrated that the proposed Gramian volume can be utlized as a novel metrics to evaluate the alignment performance of the mulitmodal model. 3. The performance of the model is outstanding than other state-of-the-art methods. 4. The use of figures and tables in this paper to illustr
1. The comparison on confusion matrices of cosine-based approach and proposed method should be exhibited in the former paper instead of in the supplementary materials. Also, in order to demonstrate the effectiveness of the proposed GRAM Multimodal Contrastive loss, it is important to conduct more ablation studies, comparing the vanilla contrastive loss. It is reasonable to conduct ablation studies including: (1) without proposed L_D2A and L_A2D, (2) without L_D2A, with L_A2D; (3) without L_A2D a
1、Innovative Methodology: The introduction of GRAM as a higher-dimensional geometric measure provides a new approach to aligning multiple modalities simultaneously, addressing the limitations of traditional pairwise methods. 2、Strong Empirical Results: The paper demonstrates substantial performance gains across multiple multimodal benchmarks, including video-audio-text retrieval and classification tasks, showcasing the effectiveness of the GRAM-based approach.
1. Does the volume complexity of GRAM computation increase significantly as the number of modalities increases? Especially in high dimensional space, does this calculation affect the real-time performance and scalability of the model? Has any consideration been given to approximate or optimize computations to reduce the computational burden? 2.GRAM is based on the volume of k-dimensional parallel polyhedra built from multimodal vectors as the alignment metric. For sparse or dense distributions
1.The idea of taking advantage of determinant of the Gram matrix of the representation vectors of all modalities to measure their similarity is novel. Besides, the computational cost of the similarity calculation is low. 2.The proposed method provides a new insight to understand the representation geometric of multimodal learning.
1.The Gram model in the experiment section is obtained from VAST pretaining models, which make it difficult to verify that Gramian measure is superior to cosine similarity in aligning modalities. The reviewer suggests the authors train the model from scratch for a fair comparison. 2.Considering this Gramian measure is a new for modality alignment, the experimental analysis for it is not enough. The reviewer expects to observe how the Gramian similarity varies during training or fintuning, and c
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsALIGN
