Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

Cheol Jun Cho; Abdelrahman Mohamed; Alan W Black; Gopala K.; Anumanchipalli

arXiv:2310.10788·eess.AS·January 17, 2024·1 cites

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K., Anumanchipalli

PDF

Open Access

TL;DR

This paper demonstrates that self-supervised speech models inherently encode articulatory dynamics, which are transferable across languages, speakers, and dialects, revealing their potential for universal, interpretable speech modeling.

Contribution

It uncovers the fundamental property of SSL speech models to infer articulatory kinematics and shows their transferability across diverse languages and speakers.

Findings

01

SSL models encode articulatory kinematics.

02

Articulatory representations are language-overlapping.

03

Affine transformations enable cross-speaker articulatory inversion.

Abstract

Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems