Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K., Anumanchipalli

TL;DR
This paper demonstrates that self-supervised speech models inherently encode articulatory dynamics, which are transferable across languages, speakers, and dialects, revealing their potential for universal, interpretable speech modeling.
Contribution
It uncovers the fundamental property of SSL speech models to infer articulatory kinematics and shows their transferability across diverse languages and speakers.
Findings
SSL models encode articulatory kinematics.
Articulatory representations are language-overlapping.
Affine transformations enable cross-speaker articulatory inversion.
Abstract
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
