End-to-End Speech Separation with Unfolded Iterative Phase   Reconstruction

Zhong-Qiu Wang; Jonathan Le Roux; DeLiang Wang; John R. Hershey

arXiv:1804.10204·cs.SD·April 30, 2018·31 cites

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey

PDF

Open Access

TL;DR

This paper introduces an end-to-end deep learning method for speech separation that directly optimizes reconstructed signals using phase reconstruction, achieving state-of-the-art results on a standard dataset.

Contribution

It presents a novel end-to-end framework with unfolded phase reconstruction layers and new activation functions for mask estimation, improving speech separation performance.

Findings

01

Achieved 12.6 dB SI-SDR on wsj0-2mix dataset.

02

Introduced phase reconstruction as part of the training process.

03

Demonstrated significant progress towards solving the cocktail party problem.

Abstract

This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing