Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech:   An Empirical Study

Chong Zhang; Yanqing Liu; Yang Zheng; Sheng Zhao

arXiv:2406.04633·eess.AS·June 10, 2024

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

Chong Zhang, Yanqing Liu, Yang Zheng, Sheng Zhao

PDF

Open Access

TL;DR

This paper empirically investigates diffusion models for spectrogram up-sampling in text-to-speech systems, aiming to improve quality and efficiency for real-time applications.

Contribution

It introduces a systematic study of diffusion model architectures and objectives specifically for spectrogram up-sampling in TTS, addressing a key bottleneck.

Findings

01

Improved speech quality with diffusion-based up-sampling.

02

Enhanced efficiency suitable for streaming synthesis.

03

Better objective and subjective metrics compared to baseline methods.

Abstract

Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the speech reconstruction quality from discrete speech token is far from satisfaction depending on the compressed speech token compression ratio. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech. LM based TTS systems usually quantize speech into discrete tokens and generate these tokens autoregressively, and finally use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms with vocoder, which has a high latency and is not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsDiffusion