With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You
Fabian Gr\"oger, Shuo Wen, Huyen Le, Maria Brbi\'c

TL;DR
This paper introduces STRUCTURE, a regularization technique that enables effective multimodal alignment with limited paired data by preserving neighborhood geometry and aligning layers with high representational similarity, significantly improving zero-shot tasks.
Contribution
The paper proposes a novel regularization method, STRUCTURE, for aligning pretrained unimodal models with limited data, and demonstrates its effectiveness across multiple benchmarks.
Findings
High-quality alignment achievable with less than 1% of typical data
Aligning layers with highest representational similarity improves performance
Substantial gains in zero-shot classification and retrieval benchmarks
Abstract
Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samplesless than of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
