Speaker- and Text-Independent Estimation of Articulatory Movements and   Phoneme Alignments from Speech

Tobias Weise; Philipp Klumpp; Kubilay Can Demir; Paula Andrea; P\'erez-Toro; Maria Schuster; Elmar Noeth; Bjoern Heismann; Andreas Maier,; Seung Hee Yang

arXiv:2407.03132·cs.SD·July 4, 2024

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea, P\'erez-Toro, Maria Schuster, Elmar Noeth, Bjoern Heismann, Andreas Maier,, Seung Hee Yang

PDF

Open Access 1 Repo

TL;DR

This paper presents a new joint approach for speaker- and text-independent estimation of articulatory movements and phoneme alignments directly from raw speech, combining acoustic-to-articulatory inversion and phoneme-to-articulatory estimation.

Contribution

It introduces the acoustic phoneme-to-articulatory speech inversion (APTAI) task and compares two novel methods for speaker- and text-independent articulatory and phoneme prediction from speech.

Findings

01

Achieved 0.73 mean correlation in articulatory inversion.

02

Reached up to 87% frame overlap with state-of-the-art aligners.

03

Demonstrated effectiveness of joint inversion and alignment from raw speech.

Abstract

This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tobwei/aptai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech and Audio Processing · Speech Recognition and Synthesis