Audio-Vision Contrastive Learning for Phonological Class Recognition

Daiqi Liu; Tom\'as Arias-Vergara; Jana Hutter; Andreas Maier; Paula Andrea P\'erez-Toro

arXiv:2507.17682·cs.SD·July 24, 2025

Audio-Vision Contrastive Learning for Phonological Class Recognition

Daiqi Liu, Tom\'as Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea P\'erez-Toro

PDF

Open Access

TL;DR

This paper introduces a contrastive learning framework combining rtMRI and speech signals for classifying phonological features, achieving state-of-the-art results in articulatory analysis.

Contribution

It presents a novel multimodal contrastive learning approach that enhances phonological class recognition using real-time MRI and audio data.

Findings

01

Contrastive learning improves classification accuracy.

02

State-of-the-art F1-score of 0.81 achieved.

03

Multimodal fusion outperforms unimodal baselines.

Abstract

Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing