InstructAudio: Unified speech and music generation with natural language instruction
Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, Jianwu Dang

TL;DR
InstructAudio is a pioneering unified framework that enables natural language instruction-based control over speech and music generation, supporting diverse attributes and multilingual capabilities, trained on extensive datasets for multi-task learning.
Contribution
It introduces the first unified model for instruction-controlled speech and music generation, leveraging joint diffusion transformer layers and standardized inputs.
Findings
Achieves superior performance on multiple metrics compared to mainstream models.
Supports expressive speech, music, and dialogue generation in English and Chinese.
Demonstrates effective multi-task learning and cross-modal alignment.
Abstract
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Music Technology and Sound Studies
