Coding Speech through Vocal Tract Kinematics
Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K., Anumanchipalli

TL;DR
This paper introduces SPARC, a neural framework that encodes and decodes speech through interpretable vocal tract kinematic features, enabling high-quality synthesis and zero-shot voice conversion.
Contribution
The paper presents a novel articulatory coding framework that infers and synthesizes speech from kinematic vocal tract features, achieving high intelligibility and speaker generalization.
Findings
Achieves fully intelligible, high-quality speech synthesis from articulatory features.
Enables zero-shot voice conversion while preserving speaker identity.
Generalizes well to unseen speakers and accents.
Abstract
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research
