Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Ping Tan

TL;DR
Symbiotic-MoE introduces a unified multimodal framework that enhances both image generation and understanding tasks by disentangling experts and employing progressive training, leading to improved synergy and performance.
Contribution
It proposes a novel modality-aware expert disentanglement and progressive training strategy within a MoE architecture to resolve task interference without extra parameters.
Findings
Achieves rapid generative convergence and improved understanding performance.
Boosts scores on MMLU and OCRBench benchmarks.
Demonstrates effective cross-modal synergy in large multimodal models.
Abstract
Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
