U-Former: Improving Monaural Speech Enhancement with Multi-head Self and   Cross Attention

Xinmeng Xu; Jianjun Hao

arXiv:2205.08681·eess.AS·October 13, 2022·ICPR

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

Xinmeng Xu, Jianjun Hao

PDF

Open Access 1 Repo

TL;DR

This paper introduces U-Former, a Transformer-based U-net architecture for monaural speech enhancement that effectively models long-term temporal and spectral dependencies using multi-head attention mechanisms, leading to improved speech quality.

Contribution

It proposes a novel U-Former model that integrates multi-head self- and cross-attention in a U-net structure for enhanced long-term context modeling in speech enhancement.

Findings

01

U-Former outperforms recent models in PESQ, STOI, and SSNR scores.

02

The multi-head attention mechanisms effectively capture long-term dependencies.

03

The model improves speech quality and intelligibility in noisy environments.

Abstract

For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinmengxu/uformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Gait Recognition and Analysis

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Max Pooling · Adam · Byte Pair Encoding · Residual Connection