TL;DR
SIREM is a novel speech-informed MRI reconstruction framework that leverages synchronized speech as a prior to improve real-time vocal-tract imaging speed and quality.
Contribution
It introduces a multimodal reconstruction method combining audio-driven prediction and MRI data, with a learnable sampling profile for enhanced speed and accuracy.
Findings
SIREM outperforms standard baselines in reconstruction quality.
It enables faster MRI reconstruction while maintaining plausible anatomy.
The method establishes a new benchmark for speech-informed rtMRI.
Abstract
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
