HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Xiang Wang; Zhifei Zhang; He Zhang; Zhe Lin; Yuqian Zhou; Qing Liu; Shiwei Zhang; Yijun Li; Shaoteng Liu; Haitian Zheng; Jason Kuen; Yuehuan Wang; Changxin Gao; Nong Sang

arXiv:2511.20520·cs.CV·November 26, 2025

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang

PDF

Open Access

TL;DR

HBridge introduces an asymmetric H-shaped architecture for multimodal models, selectively bridging layers to improve efficiency and generation quality by leveraging pretrained priors and semantic reconstruction tokens.

Contribution

The paper proposes HBridge, a novel asymmetric bridging architecture that enhances multimodal understanding and generation by selectively connecting layers and incorporating semantic tokens.

Findings

01

HBridge reduces attention sharing by over 40%, improving efficiency.

02

HBridge outperforms prior symmetric fusion models on multiple benchmarks.

03

Selective layer bridging enhances semantic alignment and generation quality.

Abstract

Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning