An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot   TTS

Xiaofei Wang; Sefik Emre Eskimez; Manthan Thakker; Hemin Yang; Zirun; Zhu; Min Tang; Yufei Xia; Jinzhu Li; Sheng Zhao; Jinyu Li; Naoyuki Kanda

arXiv:2406.05699·eess.AS·June 11, 2024

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun, Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda

PDF

Open Access

TL;DR

This paper investigates methods to improve the robustness of flow-matching-based zero-shot TTS systems against noisy prompts, introducing training strategies that significantly enhance speech quality and speaker similarity.

Contribution

The study proposes novel training strategies, including denoising pre-training, data filtering, and noise-aware fine-tuning, to improve noise robustness in zero-shot TTS.

Findings

01

Enhanced intelligibility and speaker similarity in noisy conditions

02

Significant quality improvements over simple speech enhancement methods

03

Effective training strategies for noise-robust zero-shot TTS

Abstract

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Adaptive Filtering Techniques · Speech and Audio Processing · Blind Source Separation Techniques