Transplantation of Conversational Speaking Style with Interjections in   Sequence-to-Sequence Speech Synthesis

Raul Fernandez; David Haws; Guy Lorberbom; Slava Shechtman; Alexander; Sorin

arXiv:2207.12262·eess.AS·July 26, 2022

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, Alexander, Sorin

PDF

Open Access

TL;DR

This paper investigates style transfer in sequence-to-sequence speech synthesis, focusing on conversational styles with interjections, using a dedicated corpus and voice conversion for data augmentation, achieving high-fidelity transfer with some voice shift.

Contribution

It introduces a method for one-to-many style transfer in speech synthesis using a specialized corpus and voice conversion, demonstrating effective style transfer without quality loss.

Findings

01

High-fidelity style transfer achieved

02

Voice persona shift observed

03

Voice conversion needs further improvement

Abstract

Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer Interaction systems. In this work we explore one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections. We elaborate on the corpus design and explore the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation. In a set of subjective listening experiments, this approach resulted in high-fidelity style transfer with no quality degradation. However, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing