Voice Filter: Few-shot text-to-speech speaker adaptation using voice   conversion as a post-processing module

Adam Gabry\'s; Goeric Huybrechts; Manuel Sam Ribeiro; Chung-Ming; Chien; Julian Roth; Giulia Comini; Roberto Barra-Chicote; Bartek Perz; Jaime; Lorenzo-Trueba

arXiv:2202.08164·eess.AS·February 17, 2022

Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

Adam Gabry\'s, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming, Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime, Lorenzo-Trueba

PDF

Open Access

TL;DR

This paper introduces Voice Filter, a novel low-resource TTS method that leverages voice conversion as a post-processing step, enabling high-quality speech synthesis with minimal target speaker data, outperforming existing few-shot techniques.

Contribution

It presents a new approach framing few-shot TTS as a voice conversion task, using a duration-controllable TTS system to create training data, and demonstrates superior performance with only one minute of speech.

Findings

01

Outperforms state-of-the-art few-shot TTS methods on objective metrics.

02

Achieves comparable quality to models trained on 30 times more data.

03

Effective with as little as one minute of target speaker speech.

Abstract

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker. It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task. Furthermore, we propose to use a duration-controllable TTS system to create a parallel speech corpus to facilitate the VC task. Results show that the Voice Filter outperforms state-of-the-art few-shot speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems