ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph   Reading

Yujia Xiao; Shaofei Zhang; Xi Wang; Xu Tan; Lei He; Sheng Zhao; Frank; K. Soong; Tan Lee

arXiv:2307.00782·cs.CL·October 10, 2023·1 cites

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank, K. Soong, Tan Lee

PDF

Open Access

TL;DR

ContextSpeech is a novel TTS system that effectively incorporates global context and improves efficiency, enabling more natural and expressive paragraph reading while maintaining competitive computational costs.

Contribution

It introduces a memory-cached recurrence mechanism and hierarchical textual semantics to enhance global context understanding in TTS systems.

Findings

01

Significantly improves voice quality and prosody in paragraph reading

02

Maintains competitive efficiency with lightweight design

03

Demonstrates effectiveness through extensive experiments

Abstract

While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques