The DeepZen Speech Synthesis System for Blizzard Challenge 2023
Christophe Veaux, Ranniery Maia, Spyridoula Papandreou

TL;DR
The DeepZen system for Blizzard Challenge 2023 employs an auto-regressive TTS model with advanced prosodic control and style modeling, achieving high-quality French speech synthesis from large and small datasets.
Contribution
Introduces a unified auto-regressive TTS architecture with style and pronunciation modeling using BERT, improving naturalness and control in speech synthesis.
Findings
Achieved second place in both tasks with median scores of 0.75 and 0.68.
Demonstrated effective style transfer and pronunciation prediction.
Performed well with both large and small datasets.
Abstract
This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal of this challenge is to synthesise natural and high-quality speech in French, from a large monospeaker dataset (hub task) and from a smaller dataset by speaker adaptation (spoke task). We participated to both tasks with the same model architecture. Our approach has been to use an auto-regressive model, which retains an advantage for generating natural sounding speech but to improve prosodic control in several ways. Similarly to non-attentive Tacotron, the model uses a duration predictor and gaussian upsampling at inference, but with a simpler unsupervised training. We also model the speaking style at both sentence and word levels by extracting global and local style tokens from the reference speech. At inference, the global and local style tokens are predicted from a BERT model run on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Convolution · Sigmoid Activation · Highway Layer · Tanh Activation · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Batch Normalization · Linear Warmup With Linear Decay
