Filterbank design for end-to-end speech separation

Manuel Pariente; Samuele Cornell; Antoine Deleforge; Emmanuel; Vincent

arXiv:1910.10400·cs.SD·March 2, 2020

Filterbank design for end-to-end speech separation

Manuel Pariente, Samuele Cornell, Antoine Deleforge, Emmanuel, Vincent

PDF

2 Repos

TL;DR

This paper introduces complex-valued analytic filterbanks for end-to-end speech separation, demonstrating their superiority over real-valued filterbanks and STFT in noisy conditions on the WHAM dataset.

Contribution

It extends learned and parameterized filterbanks into complex-valued analytic forms and evaluates their effectiveness in speech separation tasks.

Findings

01

Analytic learned filterbanks outperform real-valued ConvTasNet filterbanks.

02

Complex-valued representations and masks improve separation performance.

03

STFT with 2ms windows achieves optimal results.

Abstract

Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these filterbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2ms windows.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolutional time-domain audio separation network