M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with   Competitive Performance

Qingpei Guo; Kaiyou Song; Zipeng Feng; Ziping Ma; Qinglong; Zhang; Sirui Gao; Xuzheng Yu; Yunxiao Sun; Tai-Wei Chang and; Jingdong Chen; Ming Yang; Jun Zhou

arXiv:2502.18778·cs.LG·April 8, 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong, Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Tai-Wei Chang and, Jingdong Chen, Ming Yang, Jun Zhou

PDF

Open Access

TL;DR

M2-omni is an open-source omni-MLLM that supports multiple modalities with competitive performance to GPT-4o, utilizing a unified modeling framework and novel training strategies for balanced, comprehensive cross-modal understanding and generation.

Contribution

The paper introduces M2-omni, a versatile omni-MLLM with a unified multimodal framework and innovative training strategies to handle data disparities and enhance cross-modal capabilities.

Findings

01

Achieves performance comparable to GPT-4o across modalities.

02

Supports arbitrary combinations of audio, video, image, and text inputs.

03

Maintains strong language understanding while handling multimodal tasks.

Abstract

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Topic Modeling