On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments
Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

TL;DR
This paper presents a fully-convolutional neural network approach for multi-channel speech separation in reverberant environments, eliminating the need for traditional spatial features and improving separation and recognition performance.
Contribution
The paper introduces a novel end-to-end time domain speech separation method using a fully-convolutional network combined with dereverberation pre-processing, enhancing performance over conventional systems.
Findings
Source separation metric improved by over 13%.
Word error rate reduced by more than 50%.
Dereverberation pre-processing further decreased WER by 29%.
Abstract
This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
