Speaker-independent Speech Separation with Deep Attractor Network
Yi Luo, Zhuo Chen, Nima Mesgarani

TL;DR
This paper introduces a deep learning framework that effectively separates speech from multiple speakers in a single microphone setting, addressing permutation and unknown speaker number challenges with a novel attractor-based clustering approach.
Contribution
The paper presents a new deep attractor network that clusters time-frequency embeddings for speaker-independent separation, handling permutation and speaker number issues simultaneously.
Findings
Achieves comparable or better separation performance than state-of-the-art methods.
Effectively handles arbitrary speaker permutations and unknown number of speakers.
Demonstrates robustness on WSJ0 dataset with two and three speaker mixtures.
Abstract
Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture permutation problem, and the second is the unknown number of speakers in the mixture output dimension problem. We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point attractor is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
