Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities
Slava Shechtman, Alex Sorin

TL;DR
This paper introduces an automatic prosody control method for neural TTS systems, enabling sentence-wise pace and expressiveness adjustments, which enhances speech naturalness and allows for better prosody manipulation without labeled data.
Contribution
The work presents a novel automatic prosody control approach and an augmented attention mechanism to improve pace control sensitivity and speech expressiveness in neural TTS.
Findings
Subjective evaluations show improved speech expressiveness.
The proposed method allows for continuous prosody adjustments.
Enhanced attention mechanism leads to faster convergence.
Abstract
Modern sequence to sequence neural TTS systems provide close to natural speech quality. Such systems usually comprise a network converting linguistic/phonetic features sequence to an acoustic features sequence, cascaded with a neural vocoder. The generated speech prosody (i.e. phoneme durations, pitch and loudness) is implicitly present in the acoustic features, being mixed with spectral information. Although the speech sounds natural, its prosody realization is randomly chosen and cannot be easily altered. The prosody control becomes an even more difficult task if no prosodic labeling is present in the training data. Recently, much progress has been achieved in unsupervised speaking style learning and generation, however human inspection is still required after the training for discovery and interpretation of the speaking styles learned by the system. In this work we introduce a fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
