Bone-conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models
Sina Khanagha, Bunlong Lay, Timo Gerkmann

TL;DR
This paper presents a novel multimodal speech enhancement framework that combines bone-conduction and air-conducted microphone data using a conditional diffusion model, significantly improving performance in noisy environments.
Contribution
It introduces a new multimodal approach integrating bone-conduction sensors with microphones via a conditional diffusion model, advancing noise-robust speech enhancement.
Findings
Outperforms previous multimodal techniques
Outperforms diffusion-based single-modal baseline
Effective in diverse acoustic conditions
Abstract
Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
