Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Yichen Han; Xiaoyang Hao; Keming Chen; Weibo Xiong; Jun He; Ruonan Zhang; Junjie Cao; Yue Liu; Bowen Li; Dongrui Zhang; Hui Xia; Huilei Fu; Kai Jia; Kaixuan Guo; Mingli Jin; Qingyun Meng; Ruidong Ma; Ruiqian Fang; Shaotong Guo; Xuhui Li; Yang Xiang; Ying Zhang; Yulong Liu; Yunfeng Li; Yuyi Zhang; Yuze Zhou; Zhen Wang; Zhaowen Chen

arXiv:2507.12197·cs.SD·July 17, 2025

Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Yichen Han, Xiaoyang Hao, Keming Chen, Weibo Xiong, Jun He, Ruonan Zhang, Junjie Cao, Yue Liu, Bowen Li, Dongrui Zhang, Hui Xia, Huilei Fu, Kai Jia, Kaixuan Guo, Mingli Jin, Qingyun Meng, Ruidong Ma, Ruiqian Fang, Shaotong Guo, Xuhui Li, Yang Xiang, Ying Zhang, Yulong Liu

PDF

Open Access

TL;DR

This paper introduces QTTS, a novel text-to-speech framework utilizing a new residual quantization codec and advanced autoregressive modeling to achieve near-lossless, high-fidelity speech synthesis with improved detail preservation and faster inference.

Contribution

The paper presents QDAC, an end-to-end trained audio codec with superior semantic disentanglement, and two innovative autoregressive strategies for high-quality, scalable speech synthesis.

Findings

01

Achieves higher synthesis quality than baseline methods.

02

Better preservation of prosodic and speaker-specific features.

03

Faster inference with the Delay Multihead approach.

Abstract

Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis