Controllable Emphasis with zero data for text-to-speech

Arnaud Joly; Marco Nicolis; Ekaterina Peterova; Alessandro Lombardi,; Ammar Abbas; Arent van Korlaar; Aman Hussain; Parul Sharma; Alexis Moinet,; Mateusz Lajszczak; Penny Karanasou; Antonio Bonafonte; Thomas Drugman; Elena; Sokolova

arXiv:2307.07062·eess.AS·July 17, 2023

Controllable Emphasis with zero data for text-to-speech

Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi,, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet,, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena, Sokolova

PDF

Open Access

TL;DR

This paper introduces a scalable, annotation-free method for emphasizing words in text-to-speech synthesis by increasing predicted durations, improving naturalness and emphasis detection across multiple languages.

Contribution

The authors propose a simple duration-based emphasis technique that does not require recordings or annotations, outperforming spectrogram modification methods.

Findings

01

Improves naturalness by 7.3% over baseline

02

Increases correct emphasis detection by 40%

03

Effective across four languages and multiple voices

Abstract

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasized word in a sentence by $40%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing