Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP
Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna, Anumanchipalli

TL;DR
This paper introduces a fast, high-quality articulatory speech synthesizer using differentiable digital signal processing that efficiently generates speech from low-dimensional EMA features, outperforming existing models in speed and parameter efficiency.
Contribution
It presents a novel DDSP-based articulatory vocoder that synthesizes speech from EMA, F0, and loudness with improved speed, quality, and fewer parameters compared to state-of-the-art methods.
Findings
Achieves 6.67% WER and 3.74 MOS, outperforming SOTA.
4.9x faster inference on CPU than baseline.
Uses only 0.4M parameters, significantly fewer than SOTA 9M.
Abstract
Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Speech and Audio Processing
MethodsDifferentiable Digital Signal Processing
