Step-Audio 2 Technical Report
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan

TL;DR
Step-Audio 2 is an advanced multi-modal large language model that integrates audio encoding, reasoning, and retrieval techniques to excel in speech recognition and conversational audio understanding.
Contribution
It introduces a novel end-to-end model combining a latent audio encoder, reinforcement learning, and retrieval-augmented generation for improved audio and speech understanding.
Findings
Achieves state-of-the-art performance on audio understanding benchmarks.
Effectively incorporates paralinguistic features like emotions and speaking styles.
Demonstrates robust conversational capabilities across diverse scenarios.
Abstract
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stepfun-ai/Step-Audio-2-mini-Basemodel· 34 dl· ♡ 2534 dl♡ 25
- 🤗stepfun-ai/Step-Audio-2-minimodel· 2.0k dl· ♡ 2542.0k dl♡ 254
- 🤗stepfun-ai/Step-Audio-2-mini-Thinkmodel· 13 dl· ♡ 1713 dl♡ 17
- 🤗chaitnya26/Step-Audio-2-mini-forkmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗y-ren16/MCLP-CSmodel· 1 dl1 dl
- 🤗y-ren16/MCLP-RP-TTSmodel· 1 dl1 dl
- 🤗y-ren16/MCLP-RP-TTS-GRPOmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis
