Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention
Artem Gorodetskii, Ivan Ozhiganov

TL;DR
This paper introduces a zero-shot long-form voice cloning system that uses dynamic convolution attention to improve alignment and synthesis of very long speech utterances, maintaining naturalness and speaker similarity.
Contribution
It presents a novel attention mechanism combined with a modular system for zero-shot voice cloning capable of synthesizing long utterances with high quality.
Findings
Effective long-utterance synthesis with high intelligibility
Maintains naturalness and speaker similarity for short texts
Outperforms previous models in alignment and long-form speech quality
Abstract
With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
Methods[LivE@PeRson]How do I talk to a real person at Expedia? · *Communicated@Fast*How Do I Communicate to Expedia? · Tanh Activation · Long Short-Term Memory · Dilated Causal Convolution · Sigmoid Activation · Bidirectional LSTM · Batch Normalization · Location Sensitive Attention · Highway Layer
