UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Chunyu Qiang; Xiaopeng Wang; Kang Yin; Yuzhe Liang; Yuxin Guo; Teng Ma; Ziyu Zhang; Tianrui Wang; Cheng Gong; Yushen Chen; Ruibo Fu; Chen Zhang; Longbiao Wang; and Jianwu Dang

arXiv:2604.22209·eess.AS·April 27, 2026

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, and Jianwu Dang

PDF

1 Repo

TL;DR

UniSonate introduces a unified framework for generating speech, music, and sound effects from text, using a novel token injection and curriculum learning to improve cross-modal synthesis quality.

Contribution

It presents a novel unified flow-matching model with a dynamic token injection mechanism and multi-stage training for diverse audio generation tasks.

Findings

01

Achieves state-of-the-art results in instruction-based TTS and TTM.

02

Maintains competitive fidelity in TTA.

03

Joint training improves structural coherence and prosody.

Abstract

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://qiangchunyu.github.io/UniSonate
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.