Monaural source separation: From anechoic to reverberant environments

Tobias Cord-Landwehr; Christoph Boeddeker; Thilo von Neumann; Catalin; Zorila; Rama Doddipatla; Reinhold Haeb-Umbach

arXiv:2111.07578·eess.AS·May 11, 2022

Monaural source separation: From anechoic to reverberant environments

Tobias Cord-Landwehr, Christoph Boeddeker, Thilo von Neumann, Catalin, Zorila, Rama Doddipatla, Reinhold Haeb-Umbach

PDF

Open Access

TL;DR

This paper investigates adapting neural network-based monaural speech separation methods from anechoic to reverberant environments, revealing that recent improvements may not translate well to real-world reverberant conditions.

Contribution

It systematically modifies the SepFormer model to handle reverberant data and evaluates its performance, highlighting the gap between anechoic and reverberant source separation.

Findings

01

7 percentage point WER improvement over standard SepFormer

02

Reverberant system performs only marginally better than simple PIT-BLSTM

03

Recent anechoic data improvements may not generalize to reverberant environments

Abstract

Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant mixtures. Although this leads to a word error rate improvement by 7 percentage points compared to the standard SepFormer implementation, the system ends up with only marginally better performance than a PIT-BLSTM separation system, that is optimized with rather straightforward means. This is surprising and at the same time sobering, challenging the practical usefulness of many improvements reported in recent years for monaural source separation on nonreverberant data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Parameterized ReLU