Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech   Synthesis with Found Data

Yusheng Tian; Wei Liu; Tan Lee

arXiv:2305.10891·eess.AS·October 3, 2023·ASRU·1 cites

Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data

Yusheng Tian, Wei Liu, Tan Lee

PDF

Open Access 1 Repo

TL;DR

This paper proposes a diffusion-based speech enhancement method applied to log Mel-spectrograms, improving the quality of synthetic voices trained on found data with various degradations.

Contribution

It introduces a conditional diffusion model for generalized speech enhancement in the Mel-spectrogram domain, incorporating text information for robustness.

Findings

01

Enhanced synthetic speech quality on real-world recordings

02

Outperforms baseline enhancement methods

03

Available code and pre-trained models for reproducibility

Abstract

Creating synthetic voices with found data is challenging, as real-world recordings often contain various types of audio degradation. One way to address this problem is to pre-enhance the speech with an enhancement model and then use the enhanced data for text-to-speech (TTS) model training. This paper investigates the use of conditional diffusion models for generalized speech enhancement, which aims at addressing multiple types of audio degradation simultaneously. The enhancement is performed on the log Mel-spectrogram domain to align with the TTS training objective. Text information is introduced as an additional condition to improve the model robustness. Experiments on real-world recordings demonstrate that the synthetic voice built on data enhanced by the proposed model produces higher-quality synthetic speech, compared to those trained on data enhanced by strong baselines. Code and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmse4tts/dmse4tts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion · ALIGN