Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Ammar Abbas; Bajibabu Bollepalli; Alexis Moinet; Arnaud Joly; Penny; Karanasou; Peter Makarov; Simon Slangens; Sri Karlapati; Thomas Drugman

arXiv:2106.15649·eess.AS·July 1, 2021

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny, Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

PDF

TL;DR

This paper introduces a multi-scale spectrogram modeling approach for neural text-to-speech synthesis, improving prosody by predicting speech features at multiple linguistic scales, leading to better speech quality.

Contribution

The paper presents a novel multi-scale spectrogram prediction mechanism with two versions, Word-level MSS and Sentence-level MSS, leveraging linguistic units for enhanced prosody modeling.

Findings

01

Word-level MSS outperforms baseline in subjective evaluations.

02

Multi-scale approach captures both coarse and fine prosody.

03

Linguistically motivated scales improve speech synthesis quality.

Abstract

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.