Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI: Bowen Ma, Cheng Zou, ChengKun Du, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Chengxiang Fan, Dandan Zheng, Fudong Wang, Furong Xu, Guangming Yao, Haohao Liu, Han Peng, Jun Zhou, Junluan Xia, Jingdong Chen, Jianing Li, Jianxin Sun, Jianjiang Zhu

TL;DR
Ming-Flash-Omni is a highly efficient, large-scale multimodal model that unifies vision, speech, and language understanding and generation, advancing towards Artificial General Intelligence with improved performance and versatility.
Contribution
It introduces a sparse Mixture-of-Experts architecture with 100 billion parameters, enabling scalable, unified multimodal perception and generation across multiple modalities.
Findings
Achieves vision-language understanding comparable to Gemini 2.5 Pro
Excels in speech recognition and joint speech-sound-music generation
Demonstrates advanced generative semantic segmentation and in-image text rendering
Abstract
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. Notably, it achieves strong performance on vision-language understanding benchmarks, with overall scores on par with Gemini 2.5 Pro, and enables seamless switching among multimodal tasks in multi-turn interactions. In speech, it achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/Ming-flash-omni-Previewmodel· 461 dl· ♡ 71461 dl♡ 71
- 🤗fantos/Ming-flash-omni-Previewmodel· 271 dl271 dl
- 🤗inclusionAI/Ming-flash-omni-2.0model· 9.5k dl· ♡ 2569.5k dl♡ 256
- 🤗andrewheins55/Ming-flash-omni-2.1model· 175 dl· ♡ 1175 dl♡ 1
- 🤗Jonathan1909/Ming-flash-omni-2.0model· 505 dl505 dl
- 🤗servantofares/Ming-flash-omni-2.0model· 43 dl43 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
