Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng,, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang, You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin, Zhang, Hongyu Zhou, Jianjian Sun, Brian Li

TL;DR
Step-Audio introduces a comprehensive open-source speech understanding and generation model with advanced control, voice cloning, and task management capabilities, setting new benchmarks in open-source speech AI.
Contribution
It presents the first production-ready open-source unified speech-text model, a generative speech data engine, and an instruction-driven control system, advancing open-source multimodal speech AI.
Findings
Achieves state-of-the-art performance on human evaluations.
Improves open-source benchmark scores by 9.3%.
Demonstrates effective dynamic control and complex task management.
Abstract
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsLLaMA
