TL;DR
CrossFlowDG introduces a novel cross-modal flow matching approach to reduce modality gaps in domain generalization, improving model robustness across diverse visual domains.
Contribution
It proposes a new framework that explicitly transports image embeddings towards text embeddings in joint space, addressing modality gaps in multimodal domain generalization.
Findings
Achieves competitive performance on four DG benchmarks.
State-of-the-art results on TerraIncognita.
Effective cross-modal flow matching improves domain invariance.
Abstract
Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
