Ming-Omni: A Unified Multimodal Model for Perception and Generation

Inclusion AI; Biao Gong; Cheng Zou; Chuanyang Zheng; Chunluan Zhou; Canxiang Yan; Chunxiang Jin; Chunjie Shen; Dandan Zheng; Fudong Wang; Furong Xu; GuangMing Yao; Jun Zhou; Jingdong Chen; Jianxin Sun; Jiajia Liu; Jianjiang Zhu; Jun Peng; Kaixiang Ji; Kaiyou Song; Kaimeng Ren; Libin Wang; Lixiang Ru; Lele Xie; Longhua Tan; Lyuxin Xue; Lan Wang; Mochen Bai; Ning Gao; Pei Chen; Qingpei Guo; Qinglong Zhang; Qiang Xu; Rui Liu; Ruijie Xiong; Sirui Gao; Tinghao Liu; Taisong Li; Weilong Chai; Xinyu Xiao; Xiaomei Wang; Xiaoxue Chen; Xiao Lu; Xiaoyu Li; Xingning Dong; Xuzheng Yu; Yi Yuan; Yuting Gao; Yunxiao Sun; Yipeng Chen; Yifei Wu; Yongjie Lyu; Ziping Ma; Zipeng Feng; Zhijiang Fang; Zhihao Qiu; Ziyuan Huang; Zhengyu He

arXiv:2506.09344·cs.AI·June 12, 2025

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren

PDF

Open Access 1 Repo 9 Models

TL;DR

Ming-Omni is a comprehensive multimodal model that unifies processing and generation of images, text, audio, and video, enabling versatile tasks without multiple specialized models.

Contribution

It introduces a novel unified architecture with modality-specific routers and supports audio and image generation, extending beyond existing multimodal models.

Findings

01

Achieves strong performance in multimodal perception and generation tasks.

02

Supports audio and image generation with high quality.

03

Matches GPT-4o in modality support, setting a new open-source benchmark.

Abstract

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

inclusionai/ming
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis

MethodsMixture of Experts · Attentive Walk-Aggregating Graph Neural Network