Emphasis control for parallel neural TTS
Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li

TL;DR
This paper introduces a hierarchical neural TTS system that enables explicit control over prosodic emphasis by learning a latent space, improving expressiveness without sacrificing speech quality.
Contribution
It proposes a novel latent space for emphasis control in neural TTS, comparing different features and demonstrating effective emphasis manipulation at inference.
Findings
All proposed methods successfully increased perceived emphasis.
Emphasized speech was preferred over non-emphasized in listening tests.
The approach maintains high speech quality with emphasis control.
Abstract
Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis. Three candidate features for the latent space are compared: 1) Variance of pitch and duration within words in a sentence, 2) Wavelet-based feature computed from pitch, energy, and duration, and 3) Learned combination of the two aforementioned approaches. At inference time, word-level prosodic emphasis is achieved by increasing the feature values of the latent space for the given words. Experiments show that all the proposed methods are able…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsTest
