CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features
Po-han Li, Sandeep P. Chinchali, Ufuk Topcu

TL;DR
CSA is a novel method that efficiently maps unimodal features to multimodal space using limited data, outperforming existing methods without extensive training.
Contribution
Introduces CSA, a data-efficient approach for multimodal mapping that leverages unimodal encoders and matrix decomposition, reducing training data and computational requirements.
Findings
CSA outperforms CLIP with 50,000x less data
CSA surpasses state-of-the-art multimodal mapping methods
Effective for modalities beyond image and text
Abstract
Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features.…
Peer Reviews
Decision·ICLR 2025 Poster
S1: (**Data Efficiency and Good Performance**) CSA requires significantly fewer multimodal data pairs by relying on unimodal encoders, which could benefit researchers constrained by data or computational resources. It needs as little as 35,000 image-text pairs to match the performance of CLIP on ImageNet, which is especially notable. S2: (**Computational Simplicity**) One of the most accessible aspects of CSA is its ability to function effectively without requiring GPU-intensive training, which
W1: (**Hyperparameter Sensitivity and Limited Justification for $s$ Selection**) The choice of the hyperparameter $s$ in the canonical similarity metric (Section 4.2) lacks comprehensive justification. While the authors discuss a trade-off in feature distinguishability based on $s$, a more detailed sensitivity analysis showing how varying $s$ affects downstream performance across tasks would add clarity, as the framework is quite sensitive to $s$ indicated by content in Table 3. Table 3 provides
- The paper is well motivated: Finding data-efficient ways to train multimodal models using existing pretrained unimodal encoders is an important research direction. - The method used in the paper appears novel, with only one other related work (ASIF) that uses independent unimodal encoders to project embeddings onto a multimodal representational space. - The proposed approach (CSA) beats or matches CLIP’s performance in image classification tasks using significantly lesser data. - The paper’s e
- In Figure 3, CSA was only trained on in-distribution image-caption pairs. This may lead to an unfair comparison to CLIP, as the CSA training has seen the ImageNet/Leafy Spurge distribution that its being tested on during the training process. CLIP’s rise to fame is due to its general zero-shot capabilities. The zero-shot capabilities of CSA are not fully evaluated in this paper. - A fairer comparison might be to fine-tune CLIP on ImageNet/Leafy Spurge. This could be done by training the la
1. This paper proposes the canonical similarity analysis (CSA) framework, which can replicate the CLIP multi-modal model. It just uses two unimodal encoders but demands much less computation cost and related data. 2. This paper provides the theoretical analysis on the trade-off of obtaining informative embeddings and distinguishing multimodal data, considering various hyperparameters of CSA. 3. The extensive experiments on various downstream tasks (such as image classification, cross-modal retri
1. This paper proposes a post-tuning mapping framework on the unimodal features, which would compute the matrix optimization without training any encoders. Hence the performances heavily rely on the choices of visual and textual encoders. The experiments part only shows the one encoder situation (gtr + dino). More model encoders analysis for CSA are needed. 2. In the performance comparisons, the ASIF is the only fair baseline method, which lacks of persuasion. It is important to add more compara
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsContrastive Language-Image Pre-training
