UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Rui Wang; Qianguo Sun; Tianrong Chen; Zhiyun Zeng; Junlong Wu; Jiaxing Zhang

arXiv:2505.17426·cs.SD·May 26, 2025

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang

PDF

2 Repos 3 Models

TL;DR

UniTTS introduces an end-to-end TTS system that integrates comprehensive audio modeling without decoupling acoustic and semantic information, enabling flexible prompt-based speech synthesis and broad data utilization.

Contribution

It presents DistilCodec for single-codebook audio encoding and integrates multiple autoregressive tasks into UniTTS, enhancing TTS capabilities with diverse data and prompt handling.

Findings

01

Achieves near 100% utilization of 32,768-code single-codebook audio codec.

02

Enables incorporation of unlabeled high-quality audio data during training.

03

Supports interleaved text and speech prompts with preserved language model capabilities.

Abstract

The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.