Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using   Differentiable DSP

Yisi Liu; Bohan Yu; Drake Lin; Peter Wu; Cheol Jun Cho; Gopala Krishna; Anumanchipalli

arXiv:2409.02451·eess.AS·September 5, 2024

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna, Anumanchipalli

PDF

Open Access

TL;DR

This paper introduces a fast, high-quality articulatory speech synthesizer using differentiable digital signal processing that efficiently generates speech from low-dimensional EMA features, outperforming existing models in speed and parameter efficiency.

Contribution

It presents a novel DDSP-based articulatory vocoder that synthesizes speech from EMA, F0, and loudness with improved speed, quality, and fewer parameters compared to state-of-the-art methods.

Findings

01

Achieves 6.67% WER and 3.74 MOS, outperforming SOTA.

02

4.9x faster inference on CPU than baseline.

03

Uses only 0.4M parameters, significantly fewer than SOTA 9M.

Abstract

Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Speech and Audio Processing

MethodsDifferentiable Digital Signal Processing