Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

Daiqi Liu; Lukas Mulzer; Md Hasan; Nyvenn de Castro; Fangxu Xing; Xingjian Kang; Chengze Ye; Siyuan Mei; Yipeng Sun; Tom\'as Arias-Vergara; Jana Hutter; Jonghye Woo; Andreas Maier; Paula Andrea P\'erez-Toro

arXiv:2605.18466·cs.CV·May 19, 2026

Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

Daiqi Liu, Lukas Mulzer, Md Hasan, Nyvenn de Castro, Fangxu Xing, Xingjian Kang, Chengze Ye, Siyuan Mei, Yipeng Sun, Tom\'as Arias-Vergara, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea P\'erez-Toro

PDF

TL;DR

This paper introduces a novel multimodal learning framework that uses acoustic and phonological data during training to improve real-time MRI vocal tract segmentation, enabling accurate inference without audio.

Contribution

The proposed three-stage framework effectively leverages multimodal supervision to enhance vocal tract segmentation in real-time MRI, even when only imaging data is available during inference.

Findings

01

Outperforms existing unimodal and multimodal methods on benchmark datasets.

02

Demonstrates transfer of multimodal knowledge into single-modality inference.

03

Provides a clinically deployable solution for vocal tract segmentation.

Abstract

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.