TL;DR
UniTTS introduces an end-to-end TTS system that integrates comprehensive audio modeling without decoupling acoustic and semantic information, enabling flexible prompt-based speech synthesis and broad data utilization.
Contribution
It presents DistilCodec for single-codebook audio encoding and integrates multiple autoregressive tasks into UniTTS, enhancing TTS capabilities with diverse data and prompt handling.
Findings
Achieves near 100% utilization of 32,768-code single-codebook audio codec.
Enables incorporation of unlabeled high-quality audio data during training.
Supports interleaved text and speech prompts with preserved language model capabilities.
Abstract
The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
