MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen; Yafeng Chen; Yanni Chen; Mengzhe Chen; Yingda Chen; Chong; Deng; Zhihao Du; Ruize Gao; Changfeng Gao; Zhifu Gao; Yabin Li; Xiang Lv,; Jiaqing Liu; Haoneng Luo; Bin Ma; Chongjia Ni; Xian Shi; Jialong Tang; Hui; Wang; Hao Wang; Wen Wang; Yuxuan Wang; Yunlan Xu; Fan Yu; Zhijie Yan; Yexin; Yang; Baosong Yang; Xian Yang; Guanrou Yang; Tianyu Zhao; Qinglin Zhang,; Shiliang Zhang; Nan Zhao; Pei Zhang; Chong Zhang; Jinren Zhou

arXiv:2501.06282·cs.CL·January 14, 2025

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong, Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv,, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui, Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu

PDF

TL;DR

MinMo is a large multimodal language model designed for seamless, real-time voice interactions, achieving state-of-the-art performance and supporting full-duplex conversations with nuanced speech control.

Contribution

Introduces MinMo, a novel 8B parameter multimodal LLM trained with multi-stage alignment on extensive speech data, enabling advanced voice interaction capabilities.

Findings

01

Achieves state-of-the-art benchmarks in voice comprehension and generation.

02

Supports full-duplex, two-way conversations with low latency.

03

Outperforms prior models in voice quality and instruction-following.

Abstract

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus