Long-Context Speech Synthesis with Context-Aware Memory
Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu

TL;DR
This paper introduces a Context-Aware Memory (CAM) mechanism for long-context speech synthesis, improving coherence and naturalness in paragraph-level speech by dynamically integrating long-term memory and local context.
Contribution
The paper presents a novel CAM-based TTS model that effectively captures long-term context and local details, enhancing paragraph-level speech synthesis quality.
Findings
Outperforms baseline methods in prosody expressiveness
Improves coherence and style consistency in long speech
Reduces context inference cost
Abstract
In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
