Speech Synthesis and Control Using Differentiable DSP

Giorgio Fabbro; Vladimir Golkov; Thomas Kemp; Daniel Cremers

arXiv:2010.15084·eess.AS·October 29, 2020·6 cites

Speech Synthesis and Control Using Differentiable DSP

Giorgio Fabbro, Vladimir Golkov, Thomas Kemp, Daniel Cremers

PDF

Open Access

TL;DR

This paper introduces a neural vocoder leveraging differentiable digital signal processing to enable explicit control over speech variation factors, resulting in natural, controllable speech synthesis.

Contribution

It extends DDSP techniques from music to speech, allowing explicit manipulation of speech factors in neural synthesis.

Findings

01

Produces natural speech with realistic timbre

02

Allows independent control of pitch, rhythm, loudness, and timbre

03

Demonstrates effective factor manipulation in speech synthesis

Abstract

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre)\ that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing