EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

Jincheng Xie; Xingchen Xiao; Runheng Liu; Zhongyi Huang; Yu Zheng; Heyan Huang

arXiv:2604.11043·cs.AI·May 19, 2026

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang

PDF

TL;DR

EmergentBridge is a novel framework that enhances zero-shot cross-modal transfer in multimodal embedding models by effectively bridging unpaired modalities without exhaustive supervision.

Contribution

It introduces a proxy alignment method that preserves existing anchor structures while improving connectivity across unpaired modalities.

Findings

01

Consistently outperforms prior methods on nine diverse datasets.

02

Improves zero-shot classification and retrieval performance.

03

Demonstrates strong emergent alignment across modalities.

Abstract

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio $\leftrightarrow$ depth, infrared $\leftrightarrow$ audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference},…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.