TL;DR
This paper introduces complex-valued analytic filterbanks for end-to-end speech separation, demonstrating their superiority over real-valued filterbanks and STFT in noisy conditions on the WHAM dataset.
Contribution
It extends learned and parameterized filterbanks into complex-valued analytic forms and evaluates their effectiveness in speech separation tasks.
Findings
Analytic learned filterbanks outperform real-valued ConvTasNet filterbanks.
Complex-valued representations and masks improve separation performance.
STFT with 2ms windows achieves optimal results.
Abstract
Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these filterbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2ms windows.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolutional time-domain audio separation network
