Step-Audio-R1 Technical Report

Fei Tian; Xiangyu Tony Zhang; Yuxin Zhang; Haoyang Zhang; Yuxin Li; Daijiao Liu; Yayue Deng; Donghang Wu; Jun Chen; Liang Zhao; Chengyuan Yao; Hexin Liu; Eng Siong Chng; Xuerui Yang; Xiangyu Zhang; Daxin Jiang; Gang Yu

arXiv:2511.15848·cs.AI·November 27, 2025

Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

PDF

Open Access 2 Models

TL;DR

This paper introduces Step-Audio-R1, the first audio reasoning model that effectively grounds reasoning in acoustic features, surpassing previous models and demonstrating that reasoning capabilities can transfer across modalities when properly anchored.

Contribution

The paper presents the novel Step-Audio-R1 model and the Modality-Grounded Reasoning Distillation framework, enabling genuine audio reasoning grounded in acoustic features.

Findings

01

Outperforms Gemini 2.5 Pro in audio reasoning tasks.

02

Achieves comparable performance to Gemini 3 Pro.

03

Demonstrates reasoning transferability across modalities.

Abstract

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Explainable Artificial Intelligence (XAI)