Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

Sean Foley; Hong Nguyen; Jihwan Lee; Sudarsana Reddy Kadiri; Dani Byrd; Louis Goldstein; Shrikanth Narayanan

arXiv:2505.24059·cs.LG·June 2, 2025

Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

Sean Foley, Hong Nguyen, Jihwan Lee, Sudarsana Reddy Kadiri, Dani Byrd, Louis Goldstein, Shrikanth Narayanan

PDF

Open Access

TL;DR

This study develops unimodal and multimodal phoneme recognition models using a single-speaker MRI corpus to better understand the distinct contributions of acoustics and articulation, revealing both similarities and differences in their encoding.

Contribution

It introduces a novel approach using a single-speaker MRI dataset to disentangle and interpret the roles of acoustics and articulation in phoneme recognition.

Findings

01

Audio and multimodal models perform similarly on phonetic manner classes.

02

Models diverge on places of articulation, indicating different modality contributions.

03

Latent space analysis shows similar phonetic encoding across modalities.

Abstract

Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need