Zero-Shot Text-to-Speech from Continuous Text Streams

Trung Dang; David Aponte; Dung Tran; Tianyi Chen; Kazuhito Koishida

arXiv:2410.00767·cs.SD·October 2, 2024

Zero-Shot Text-to-Speech from Continuous Text Streams

Trung Dang, David Aponte, Dung Tran, Tianyi Chen, Kazuhito Koishida

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LiveSpeech 2, a streaming-aware zero-shot TTS model capable of generating continuous, synchronized speech from ongoing text streams with seamless chunk transitions, outperforming prior sentence-based systems.

Contribution

The paper presents LiveSpeech 2, a novel stream-aware zero-shot TTS model that supports infinite text streams, real-time synchronization, and smooth chunk transitions, using innovative sequence modeling and attention techniques.

Findings

01

Competitive with state-of-the-art zero-shot TTS models

02

Supports infinitely long speech generation from continuous text streams

03

Enables seamless transitions between speech chunks

Abstract

Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The paper introduces L3Speech, a Mamba-based TTS system designed for continuous text input in short chunks, essential for streaming applications. 1. This paper explores the use of Mamba as the backbone for AR-based TTS. 2. It proposes a decoding method assisted by semantic guidance to enhance model stability. 3. Various tricks, such as improved rotary positional embeddings, are designed to improve the model's performance in streaming inference, including streaming text input.

Weaknesses

The paper introduces L3Speech, a Mamba-based TTS system designed for continuous text input in short chunks, essential for streaming applications. It proposes techniques such as rotary positional embeddings for infinite text stream processing, semantic guidance, and optimized text chunk length and quantity selection. However, it has notable weaknesses, including: 1. Although this paper claims to support infinite text input and streaming inference, it only evaluates on the LibriTTS dataset. I thi

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is well-written and easy to follow. - This paper addresses an emerging and impactful area—streaming TTS with integration into large language models. By targeting the latency challenges and supporting flexible, continuous input, the model shows promise for real-time applications. - The proposed architecture is well-grounded, utilizing the Mamba-based decoder, rotary positional embeddings, and semantic guidance via grapheme tokens. These design choices are technically solid, and the

Weaknesses

- Missing reference section. - Lack of Real-Time Runtime Analysis: A runtime analysis comparing the proposed method with baselines—especially in scenarios where L3Speech is cascaded with a language model—would provide a clearer picture of the latency benefits. Including this analysis would strengthen the paper by illustrating the real-world efficiency gains.

Reviewer 03Rating 6Confidence 3

Strengths

- A streaming TTS approach is provided with good results - Semantic guidance is used to alleviate speech text misalignment.

Weaknesses

- Related work is not complete. There are many streaming based TTS models have been proposed, such as Transducer based TTS [1,2], which is time synchronized based and naturally fit to the streaming applications. - Reference section is missing - Motivation of some method choices are not discussed, for example, positional indices based on arrival time. - Missing the study of the importance of acoustic model choice during soft guidance. It could be critical for languages with low resource and inf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems