Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
Mahsa Elyasi, Gaurav Bharaj

TL;DR
This paper enhances neural TTS by conditioning Tacotron-2 on prosodic features like stress and pitch accent, leading to more natural speech with accurate prosody and improved quality over the baseline.
Contribution
It introduces a novel feature conditioning strategy at multiple stages of Tacotron-2, effectively modeling prosodic features for more natural speech synthesis.
Findings
Higher fundamental frequency contour correlation
Lower Mel Cepstral Distortion
Mean Opinion Score of 4.14 surpassing baseline
Abstract
Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Tacotron-2, transforms text into high-quality speech. However, generating speech with natural prosody still remains a challenge. Yasuda et. al. show that unlike natural speech, Tacotron-2's encoder doesn't fully represent prosodic features (e.g. syllable stress in English) from characters, and result in flat fundamental frequency variations. In this work, we propose a novel carefully designed strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent, that help achieve more natural prosody. To this end, we use of a classifier to learn these features in an end-to-end fashion, and apply feature conditioning at three parts of Tacotron-2's Text-To-Mel Spectrogram: pre-encoder, post-encoder, and intra-decoder. Further, we show that jointly conditioned features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
