DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu; Changyan Zheng; Xingyu Zhang; Yakun Zhang; Chengshi Zheng; Shuang Yang; Ye Yan; and Erwei Yin

arXiv:2603.02877·eess.AS·March 4, 2026·Appl. Intell.

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu, Changyan Zheng, Xingyu Zhang, Yakun Zhang, Chengshi Zheng, Shuang Yang, Ye Yan, and Erwei Yin

PDF

Open Access

TL;DR

This paper introduces DBMIF, a novel multimodal speech enhancement framework that effectively combines air- and bone-conduction signals through iterative fusion, significantly improving speech quality and ASR performance in noisy environments.

Contribution

The paper presents a new three-branch deep fusion architecture with iterative cross-modal interaction and a balanced representation learning mechanism for robust speech enhancement.

Findings

01

DBMIF outperforms recent baselines in speech quality and intelligibility.

02

It reduces character error rate in ASR by at least 2.5%.

03

Demonstrates robustness across diverse noise types.

Abstract

The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation