Emphasis control for parallel neural TTS

Shreyas Seshadri; Tuomo Raitio; Dan Castellani; Jiangchuan Li

arXiv:2110.03012·eess.AS·March 30, 2022

Emphasis control for parallel neural TTS

Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li

PDF

Open Access

TL;DR

This paper introduces a hierarchical neural TTS system that enables explicit control over prosodic emphasis by learning a latent space, improving expressiveness without sacrificing speech quality.

Contribution

It proposes a novel latent space for emphasis control in neural TTS, comparing different features and demonstrating effective emphasis manipulation at inference.

Findings

01

All proposed methods successfully increased perceived emphasis.

02

Emphasized speech was preferred over non-emphasized in listening tests.

03

The approach maintains high speech quality with emphasis control.

Abstract

Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis. Three candidate features for the latent space are compared: 1) Variance of pitch and duration within words in a sentence, 2) Wavelet-based feature computed from pitch, energy, and duration, and 3) Learned combination of the two aforementioned approaches. At inference time, word-level prosodic emphasis is achieved by increasing the feature values of the latent space for the given words. Experiments show that all the proposed methods are able…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsTest