Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input
Changsong Liu, Tianrui Wang, Ye Ni, Yizhou Peng, Eng Siong Chng

TL;DR
This paper introduces a prosodic boundary-aware streaming TTS method that improves naturalness and coherence in long-form speech synthesis by adaptively learning to stop at content boundaries, ensuring seamless streaming.
Contribution
It proposes a novel post-training adaptation technique for LLM-based TTS models that enables effective streaming synthesis with bounded context and improved prosody.
Findings
Achieves a 66.2% reduction in word error rate for long-text synthesis.
Outperforms baseline in speaker and emotion similarity metrics.
Ensures seamless concatenation in streaming TTS scenarios.
Abstract
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
