Long-Context Speech Synthesis with Context-Aware Memory

Zhipeng Li; Xiaofen Xing; Jingyuan Xing; Hangrui Hu; Heng Lu; Xiangmin Xu

arXiv:2508.14713·eess.AS·August 21, 2025

Long-Context Speech Synthesis with Context-Aware Memory

Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu

PDF

Open Access

TL;DR

This paper introduces a Context-Aware Memory (CAM) mechanism for long-context speech synthesis, improving coherence and naturalness in paragraph-level speech by dynamically integrating long-term memory and local context.

Contribution

The paper presents a novel CAM-based TTS model that effectively captures long-term context and local details, enhancing paragraph-level speech synthesis quality.

Findings

01

Outperforms baseline methods in prosody expressiveness

02

Improves coherence and style consistency in long speech

03

Reduces context inference cost

Abstract

In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems