Open-set Cross Modal Generalization via Multimodal Unified Representation

Hai Huang; Yan Xia; Shulei Wang; Hanting Wang; Minghui Fang; Shengpeng Ji; Sashuai Zhou; Tao Jin; Zhou Zhao

arXiv:2507.14935·cs.CV·July 22, 2025

Open-set Cross Modal Generalization via Multimodal Unified Representation

Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, Zhou Zhao

PDF

TL;DR

This paper introduces OSCMG, a new challenging task for evaluating multimodal models' ability to generalize to unseen classes in open-set environments, and proposes MICU with novel contrastive and self-supervised techniques to address this.

Contribution

It extends cross-modal generalization to open-set scenarios and proposes MICU, a multimodal framework with FCMI and CUJP for improved open-set generalization.

Findings

01

MICU outperforms existing methods on CMG and OSCMG benchmarks.

02

The proposed techniques improve robustness to unseen classes.

03

Extensive experiments validate the effectiveness of MICU.

Abstract

This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.