SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
Truong Pham, Anay Majee, Rishabh Iyer

TL;DR
This paper introduces SMA, a set-based, submodular approach to multimodal alignment that improves data efficiency and generalization in low-data scenarios by leveraging multiple descriptions and augmentations.
Contribution
The paper proposes a novel set-based, submodular formulation for multimodal alignment, moving beyond pairwise methods to better utilize limited data and capture richer cross-modal structure.
Findings
SMA outperforms existing methods on 14 zero-shot tasks in low-data regimes.
SMA achieves strong multimodal generalization with only tens of thousands of samples.
Set-based, submodular objectives enhance data efficiency in multimodal learning.
Abstract
Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emph{Submodular Modality Aligner (SMA)}, which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
