Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI: Bowen Ma; Cheng Zou; ChengKun Du; Canxiang Yan; Chunxiang Jin; Chunjie Shen; Chenyu Lian; Chengxiang Fan; Dandan Zheng; Fudong Wang; Furong Xu; Guangming Yao; Haohao Liu; Han Peng; Jun Zhou; Junluan Xia; Jingdong Chen; Jianing Li; Jianxin Sun; Jianjiang Zhu; Jianping Jiang; Jinpeng Ou; Jun Peng; Jin Peng; Kaixiang Ji; Li Tang; Libin Wang; Lixiang Ru; Longhua Tan; Lu Ma; Lan Wang; Mochen Bai; Minghong Cai; Mingxue Yang; Ning Gao; Qingpei Guo; Qinglong Zhang; Qiang Xu; Qin Zhao; Rui Liu; Ruijie Xiong; Ruobing Zheng; Sirui Gao; Shaoxiong Lin; Tao Zhang; Tianqi Li; Tinghao Liu; Tongli Wang; Taoye Huang; Weilong Chai; Xiaomei Wang; Xiaolong Wang; Xiaojian Liu; Xiao Lu; Xiaoyu Li; Xingning Dong; Xuzheng Yu; Xuezhi Wang; Yi Yuan; Yuting Gao; Yuting Xiao; Yunxiao Sun; Yipeng Chen; Yifan Mao; Yifei Wu; Yongjie Lyu; Yingying Zhang; YuQian Li; Ziping Ma; Zhiqiang Fang; Zhihao Qiu; Ziyuan Huang; Zizheng Yang; Zhengyu He

arXiv:2510.24821·cs.CV·March 27, 2026

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI: Bowen Ma, Cheng Zou, ChengKun Du, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Chengxiang Fan, Dandan Zheng, Fudong Wang, Furong Xu, Guangming Yao, Haohao Liu, Han Peng, Jun Zhou, Junluan Xia, Jingdong Chen, Jianing Li, Jianxin Sun, Jianjiang Zhu

PDF

6 Models

TL;DR

Ming-Flash-Omni is a highly efficient, large-scale multimodal model that unifies vision, speech, and language understanding and generation, advancing towards Artificial General Intelligence with improved performance and versatility.

Contribution

It introduces a sparse Mixture-of-Experts architecture with 100 billion parameters, enabling scalable, unified multimodal perception and generation across multiple modalities.

Findings

01

Achieves vision-language understanding comparable to Gemini 2.5 Pro

02

Excels in speech recognition and joint speech-sound-music generation

03

Demonstrates advanced generative semantic segmentation and in-image text rendering

Abstract

We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. Notably, it achieves strong performance on vision-language understanding benchmarks, with overall scores on par with Gemini 2.5 Pro, and enables seamless switching among multimodal tasks in multi-turn interactions. In speech, it achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.