Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea, P\'erez-Toro, Maria Schuster, Elmar Noeth, Bjoern Heismann, Andreas Maier,, Seung Hee Yang

TL;DR
This paper presents a new joint approach for speaker- and text-independent estimation of articulatory movements and phoneme alignments directly from raw speech, combining acoustic-to-articulatory inversion and phoneme-to-articulatory estimation.
Contribution
It introduces the acoustic phoneme-to-articulatory speech inversion (APTAI) task and compares two novel methods for speaker- and text-independent articulatory and phoneme prediction from speech.
Findings
Achieved 0.73 mean correlation in articulatory inversion.
Reached up to 87% frame overlap with state-of-the-art aligners.
Demonstrated effectiveness of joint inversion and alignment from raw speech.
Abstract
This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech and Audio Processing · Speech Recognition and Synthesis
