Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Xinmeng Xu, Rongzhi Gu, Yuexian Zou

TL;DR
This paper introduces a novel dual-microphone speech enhancement model using multi-head cross-attention to better learn cross-channel features, combined with a multi-task SNR estimator and spectral gain for improved noise suppression.
Contribution
The paper proposes a new MHCA-CRN architecture that effectively learns cross-channel features and incorporates a multi-task SNR estimator to reduce speech distortion in dual-microphone speech enhancement.
Findings
Outperforms several state-of-the-art models in speech enhancement tasks.
Effectively learns mutual relationships between spatial and spectral features.
Reduces residual noise and speech distortion.
Abstract
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
