Training Articulatory Inversion Models for Interspeaker Consistency
Charles McGhee, Mark J.F. Gales, Kate M. Knill

TL;DR
This paper explores training self-supervised models for acoustic-to-articulatory inversion to achieve consistent articulatory predictions across different speakers in English and Russian, introducing a novel evaluation method and training approach.
Contribution
It introduces a new training method and evaluation technique to enhance interspeaker consistency in articulatory inversion models using speech data.
Findings
Models trained with the proposed method show improved interspeaker consistency.
The evaluation method effectively measures articulatory target similarity across speakers.
Results are demonstrated on English and Russian datasets.
Abstract
Acoustic-to-Articulatory Inversion (AAI) attempts to model the inverse mapping from speech to articulation. Exact articulatory prediction from speech alone may be impossible, as speakers can choose different forms of articulation seemingly without reference to their vocal tract structure. However, once a speaker has selected an articulatory form, their productions vary minimally. Recent works in AAI have proposed adapting Self-Supervised Learning (SSL) models to single-speaker datasets, claiming that these single-speaker models provide a universal articulatory template. In this paper, we investigate whether SSL-adapted models trained on single and multi-speaker data produce articulatory targets which are consistent across speaker identities for English and Russian. We do this through the use of a novel evaluation method which extracts articulatory targets using minimal pair sets. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Speech and Audio Processing
