Bone-conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models

Sina Khanagha; Bunlong Lay; Timo Gerkmann

arXiv:2601.12354·eess.AS·January 21, 2026

Bone-conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models

Sina Khanagha, Bunlong Lay, Timo Gerkmann

PDF

Open Access

TL;DR

This paper presents a novel multimodal speech enhancement framework that combines bone-conduction and air-conducted microphone data using a conditional diffusion model, significantly improving performance in noisy environments.

Contribution

It introduces a new multimodal approach integrating bone-conduction sensors with microphones via a conditional diffusion model, advancing noise-robust speech enhancement.

Findings

01

Outperforms previous multimodal techniques

02

Outperforms diffusion-based single-modal baseline

03

Effective in diverse acoustic conditions

Abstract

Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis