Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
Souptik Sen, Raneen Younis, Zahra Ahmadi

TL;DR
CoDAAR introduces a novel discrete multimodal learning framework that aligns semantic representations across modalities, enabling robust cross-modal and cross-domain generalization while preserving modality-specific structures.
Contribution
It proposes a unified discrete space with index-level alignment, combining temporal and semantic alignment mechanisms to improve multimodal generalization.
Findings
Achieves state-of-the-art results on multiple cross-modal benchmarks.
Effectively preserves modality-specific structures while enabling generalization.
Demonstrates robustness across diverse multimodal tasks.
Abstract
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
