MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions
Shengkui Zhao, Bin Ma

TL;DR
MossFormer introduces a novel gated single-head transformer with convolution-augmented joint self-attentions, significantly improving monaural speech separation performance and approaching the theoretical upper bounds.
Contribution
The paper proposes MossFormer, a new architecture combining joint local and global self-attention with gating and convolution, achieving state-of-the-art results in speech separation.
Findings
Achieves state-of-the-art results on WSJ0-2/3mix and WHAM! benchmarks.
Reaches the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix.
Performs within 0.3 dB of the theoretical upper bound on WSJ0-2mix.
Abstract
Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation
MethodsAttention Is All You Need · Linear Layer · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Dense Connections
