Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. P\'erez, Juan-Manuel P\'erez-R\'ua, Tao Xiang, Wei Liu, Shikun Liu, J\"urgen Schmidhuber

TL;DR
MoS introduces a token-wise routing mechanism for multimodal diffusion models, enabling efficient and flexible fusion of modalities that achieves state-of-the-art results with fewer parameters.
Contribution
The paper presents MoS, a novel token-level fusion paradigm with a learnable router for multimodal diffusion models, improving efficiency and performance.
Findings
Achieves state-of-the-art results in text-to-image generation and editing.
Models with 3B-5B parameters match or surpass larger counterparts.
Efficient routing with minimal computational overhead.
Abstract
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top- hidden states and is trained with an -greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to larger. These findings establish MoS as a flexible and compute-efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Cell Image Analysis Techniques
