Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Canxiang Yan; Chunxiang Jin; Dawei Huang; Haibing Yu; Han Peng; Hui Zhan; Jie Gao; Jing Peng; Jingdong Chen; Jun Zhou; Kaimeng Ren; Ming Yang; Mingxue Yang; Qiang Xu; Qin Zhao; Ruijie Xiong; Shaoxiong Lin; Xuezhi Wang; Yi Yuan; Yifei Wu; Yongjie Lyu; Zhengyu He; Zhihao Qiu; Zhiqiang Fang; Ziyuan Huang

arXiv:2511.05516·cs.CL·November 11, 2025

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu

PDF

Open Access 2 Datasets

TL;DR

Ming-UniAudio introduces a unified speech model and tokenizer that effectively combines understanding, generation, and editing capabilities, achieving state-of-the-art results and enabling natural language-guided speech editing.

Contribution

The paper presents MingTok-Audio, the first continuous speech tokenizer integrating semantic and acoustic features, and Ming-UniAudio, a unified model for speech understanding, generation, and editing, along with a new benchmark for speech editing.

Findings

01

Achieves SOTA on 8 out of 12 metrics on ContextASR.

02

Attains Seed-TTS-WER of 0.95 for Chinese voice cloning.

03

Enables universal, free-form speech editing guided by natural language.

Abstract

Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Music Technology and Sound Studies