InstructAudio: Unified speech and music generation with natural language instruction

Chunyu Qiang; Kang Yin; Xiaopeng Wang; Yuzhe Liang; Jiahui Zhao; Ruibo Fu; Tianrui Wang; Cheng Gong; Chen Zhang; Longbiao Wang; Jianwu Dang

arXiv:2511.18487·eess.AS·November 25, 2025

InstructAudio: Unified speech and music generation with natural language instruction

Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, Jianwu Dang

PDF

Open Access

TL;DR

InstructAudio is a pioneering unified framework that enables natural language instruction-based control over speech and music generation, supporting diverse attributes and multilingual capabilities, trained on extensive datasets for multi-task learning.

Contribution

It introduces the first unified model for instruction-controlled speech and music generation, leveraging joint diffusion transformer layers and standardized inputs.

Findings

01

Achieves superior performance on multiple metrics compared to mainstream models.

02

Supports expressive speech, music, and dialogue generation in English and Chinese.

03

Demonstrates effective multi-task learning and cross-modal alignment.

Abstract

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Music Technology and Sound Studies