Audio-Vision Contrastive Learning for Phonological Class Recognition
Daiqi Liu, Tom\'as Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea P\'erez-Toro

TL;DR
This paper introduces a contrastive learning framework combining rtMRI and speech signals for classifying phonological features, achieving state-of-the-art results in articulatory analysis.
Contribution
It presents a novel multimodal contrastive learning approach that enhances phonological class recognition using real-time MRI and audio data.
Findings
Contrastive learning improves classification accuracy.
State-of-the-art F1-score of 0.81 achieved.
Multimodal fusion outperforms unimodal baselines.
Abstract
Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
