Supervised and Unsupervised Approaches for Controlling Narrow Lexical   Focus in Sequence-to-Sequence Speech Synthesis

Slava Shechtman; Raul Fernandez; David Haws

arXiv:2101.09940·eess.AS·April 6, 2021·SLT

Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Slava Shechtman, Raul Fernandez, David Haws

PDF

TL;DR

This paper introduces a framework for controlling prosodic features, specifically lexical focus, in sequence-to-sequence speech synthesis using interpretable parameters, improving controllability without sacrificing naturalness.

Contribution

The work presents novel architectures for controlling lexical focus in speech synthesis, leveraging different levels of supervision and demonstrating effective, natural-sounding controllable outputs.

Findings

01

Successful control of lexical focus demonstrated in listening tests

02

Controllable synthesis maintains or exceeds baseline naturalness

03

Different supervision levels impact focus control effectiveness

Abstract

Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.