U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention
Xinmeng Xu, Jianjun Hao

TL;DR
This paper introduces U-Former, a Transformer-based U-net architecture for monaural speech enhancement that effectively models long-term temporal and spectral dependencies using multi-head attention mechanisms, leading to improved speech quality.
Contribution
It proposes a novel U-Former model that integrates multi-head self- and cross-attention in a U-net structure for enhanced long-term context modeling in speech enhancement.
Findings
U-Former outperforms recent models in PESQ, STOI, and SSNR scores.
The multi-head attention mechanisms effectively capture long-term dependencies.
The model improves speech quality and intelligibility in noisy environments.
Abstract
For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Gait Recognition and Analysis
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Max Pooling · Adam · Byte Pair Encoding · Residual Connection
