Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li; Xinyu Chen; Shenyuan Jiang; Haoyuan Shi; Zhenyu Liu; Xuanyu Zhang; Nanhao Deng; Zhenran Xu; Yicheng Ma; Meishan Zhang; Baotian Hu; Min Zhang

arXiv:2511.12609·cs.CL·November 25, 2025

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

PDF

Open Access 4 Models

TL;DR

Uni-MoE 2.0-Omni is an open-source, large-scale omnimodal model that advances multimodal understanding and generation across language, image, and speech through innovative MoE design, training strategies, and data matching techniques.

Contribution

It introduces a dynamic-capacity MoE framework, a progressive training strategy with reinforcement, and a multimodal data matching method, enabling efficient and capable omnimodal large model development.

Findings

01

Achieves state-of-the-art or competitive results on 85 benchmarks.

02

Surpasses Qwen2.5-Omni on over 50 benchmarks.

03

Improves video understanding, omnimodality, and audiovisual reasoning.

Abstract

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis