MossFormer: Pushing the Performance Limit of Monaural Speech Separation   using Gated Single-Head Transformer with Convolution-Augmented Joint   Self-Attentions

Shengkui Zhao; Bin Ma

arXiv:2302.11824·cs.SD·February 24, 2023·1 cites

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

Shengkui Zhao, Bin Ma

PDF

Open Access 2 Repos 1 Models

TL;DR

MossFormer introduces a novel gated single-head transformer with convolution-augmented joint self-attentions, significantly improving monaural speech separation performance and approaching the theoretical upper bounds.

Contribution

The paper proposes MossFormer, a new architecture combining joint local and global self-attention with gating and convolution, achieving state-of-the-art results in speech separation.

Findings

01

Achieves state-of-the-art results on WSJ0-2/3mix and WHAM! benchmarks.

02

Reaches the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix.

03

Performs within 0.3 dB of the theoretical upper bound on WSJ0-2mix.

Abstract

Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
samson-castalk/ClearerVoice-Studio
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation

MethodsAttention Is All You Need · Linear Layer · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Dense Connections