Speech Synthesis and Control Using Differentiable DSP
Giorgio Fabbro, Vladimir Golkov, Thomas Kemp, Daniel Cremers

TL;DR
This paper introduces a neural vocoder leveraging differentiable digital signal processing to enable explicit control over speech variation factors, resulting in natural, controllable speech synthesis.
Contribution
It extends DDSP techniques from music to speech, allowing explicit manipulation of speech factors in neural synthesis.
Findings
Produces natural speech with realistic timbre
Allows independent control of pitch, rhythm, loudness, and timbre
Demonstrates effective factor manipulation in speech synthesis
Abstract
Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre)\ that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
