Masked Autoencoders Are Articulatory Learners
Ahmed Adel Attia, Carol Espy-Wilson

TL;DR
This paper introduces a deep learning method using Masked Autoencoders to accurately reconstruct mistracked articulatory recordings in speech datasets, significantly improving data usability for speech research.
Contribution
The study presents a novel application of Masked Autoencoders to recover corrupted articulatory data, enabling the use of previously unusable recordings in speech analysis.
Findings
Successfully reconstructed articulatory trajectories for most speakers.
Recovered 3.28 hours of previously unusable data.
Achieved close match to ground truth articulatory trajectories.
Abstract
Articulatory recordings track the positions and motion of different articulators along the vocal tract and are widely used to study speech production and to develop speech technologies such as articulatory based speech synthesizers and speech inversion systems. The University of Wisconsin X-Ray microbeam (XRMB) dataset is one of various datasets that provide articulatory recordings synced with audio recordings. The XRMB articulatory recordings employ pellets placed on a number of articulators which can be tracked by the microbeam. However, a significant portion of the articulatory recordings are mistracked, and have been so far unsuable. In this work, we present a deep learning based approach using Masked Autoencoders to accurately reconstruct the mistracked articulatory recordings for 41 out of 47 speakers of the XRMB dataset. Our model is able to reconstruct articulatory trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
