SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

Truong Pham; Anay Majee; Rishabh Iyer

arXiv:2605.12872·cs.LG·May 14, 2026

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

Truong Pham, Anay Majee, Rishabh Iyer

PDF

TL;DR

This paper introduces SMA, a set-based, submodular approach to multimodal alignment that improves data efficiency and generalization in low-data scenarios by leveraging multiple descriptions and augmentations.

Contribution

The paper proposes a novel set-based, submodular formulation for multimodal alignment, moving beyond pairwise methods to better utilize limited data and capture richer cross-modal structure.

Findings

01

SMA outperforms existing methods on 14 zero-shot tasks in low-data regimes.

02

SMA achieves strong multimodal generalization with only tens of thousands of samples.

03

Set-based, submodular objectives enhance data efficiency in multimodal learning.

Abstract

Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emph{Submodular Modality Aligner (SMA)}, which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.