Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

Souptik Sen; Raneen Younis; Zahra Ahmadi

arXiv:2605.12145·cs.CV·May 14, 2026

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

Souptik Sen, Raneen Younis, Zahra Ahmadi

PDF

TL;DR

CoDAAR introduces a novel discrete multimodal learning framework that aligns semantic representations across modalities, enabling robust cross-modal and cross-domain generalization while preserving modality-specific structures.

Contribution

It proposes a unified discrete space with index-level alignment, combining temporal and semantic alignment mechanisms to improve multimodal generalization.

Findings

01

Achieves state-of-the-art results on multiple cross-modal benchmarks.

02

Effectively preserves modality-specific structures while enabling generalization.

03

Demonstrates robustness across diverse multimodal tasks.

Abstract

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.