LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive   Modeling of Audio Discrete Codes

Trung Dang; David Aponte; Dung Tran; Kazuhito Koishida

arXiv:2406.02897·cs.SD·June 11, 2024

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida

PDF

Open Access

TL;DR

LiveSpeech introduces a low-latency, autoregressive text-to-speech model that generates high-quality speech in real-time by predicting multiple audio tokens simultaneously, advancing zero-shot TTS capabilities.

Contribution

The paper proposes a novel autoregressive model with adaptive loss weighting and parallel token grouping, enabling low-latency zero-shot TTS with competitive quality.

Findings

01

Achieves real-time streaming with high content accuracy.

02

Maintains speaker similarity and audio quality.

03

Outperforms existing methods in inference speed.

Abstract

Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques