End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey

TL;DR
This paper introduces an end-to-end deep learning method for speech separation that directly optimizes reconstructed signals using phase reconstruction, achieving state-of-the-art results on a standard dataset.
Contribution
It presents a novel end-to-end framework with unfolded phase reconstruction layers and new activation functions for mask estimation, improving speech separation performance.
Findings
Achieved 12.6 dB SI-SDR on wsj0-2mix dataset.
Introduced phase reconstruction as part of the training process.
Demonstrated significant progress towards solving the cocktail party problem.
Abstract
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
