Ming-Omni: A Unified Multimodal Model for Perception and Generation
Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren

TL;DR
Ming-Omni is a comprehensive multimodal model that unifies processing and generation of images, text, audio, and video, enabling versatile tasks without multiple specialized models.
Contribution
It introduces a novel unified architecture with modality-specific routers and supports audio and image generation, extending beyond existing multimodal models.
Findings
Achieves strong performance in multimodal perception and generation tasks.
Supports audio and image generation with high quality.
Matches GPT-4o in modality support, setting a new open-source benchmark.
Abstract
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/Ming-Lite-Omnimodel· 74 dl· ♡ 19874 dl♡ 198
- 🤗inclusionAI/Ming-Lite-Omni-1.5model· 157 dl· ♡ 85157 dl♡ 85
- 🤗wikeeyang/Ming-Lite-Omni-v1.5-NF4model· 6 dl· ♡ 36 dl♡ 3
- 🤗inclusionAI/Ming-flash-omni-Previewmodel· 461 dl· ♡ 71461 dl♡ 71
- 🤗fantos/Ming-flash-omni-Previewmodel· 271 dl271 dl
- 🤗inclusionAI/Ming-flash-omni-2.0model· 9.5k dl· ♡ 2569.5k dl♡ 256
- 🤗andrewheins55/Ming-flash-omni-2.1model· 175 dl· ♡ 1175 dl♡ 1
- 🤗Jonathan1909/Ming-flash-omni-2.0model· 505 dl505 dl
- 🤗servantofares/Ming-flash-omni-2.0model· 43 dl43 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
MethodsMixture of Experts · Attentive Walk-Aggregating Graph Neural Network
