Word-Level Style Control for Expressive, Non-attentive Speech Synthesis
Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park,, Spyros Raptis

TL;DR
This paper introduces a novel non-attentive speech synthesis model that enables word-level style control and prosody transfer by disentangling style and phonetic information using dual encoders and a prior model.
Contribution
It proposes a new architecture with dual encoders for style and phonetic representations, allowing fine-grained style control at the word level in speech synthesis.
Findings
Model achieves word-level and global style control.
Enables prosody transfer without reference utterance.
Disentangles style from phonetic content effectively.
Abstract
This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level sequence conditioned only on the phonetic information in order to disentangle it from the style information. The two encoder outputs are aligned and concatenated with the phoneme encoder outputs and then decoded with a Non-Attentive Tacotron model. An extra prior encoder is used to predict the style tokens autoregressively, in order for the model to be able to run without a reference utterance. We find that the resulting model gives both word-level and global control over the style, as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Highway Layer · Highway Network · Bidirectional GRU · Max Pooling · Batch Normalization · Convolution · CBHG · Gated Recurrent Unit · Residual Connection
