Improved Source Counting and Separation for Monaural Mixture
Yiming Xiao, Haijian Zhang

TL;DR
This paper introduces a novel single-channel speech separation model that jointly estimates the number of speakers and separates their voices, achieving high accuracy and state-of-the-art results.
Contribution
A new model integrating time-frequency features and speaker counting via Gerschgorin disks, enabling accurate speaker number estimation and separation in monaural mixtures.
Findings
96.7% probability of correctly estimating speaker count
State-of-the-art SI-SNRi and SDRi performance on GRID dataset
Effective joint learning of speaker counting and separation
Abstract
Single-channel speech separation in time domain and frequency domain has been widely studied for voice-driven applications over the past few years. Most of previous works assume known number of speakers in advance, however, which is not easily accessible through monaural mixture in practice. In this paper, we propose a novel model of single-channel multi-speaker separation by jointly learning the time-frequency feature and the unknown number of speakers. Specifically, our model integrates the time-domain convolution encoded feature map and the frequency-domain spectrogram by attention mechanism, and the integrated features are projected into high-dimensional embedding vectors which are then clustered with deep attractor network to modify the encoded feature. Meanwhile, the number of speakers is counted by computing the Gerschgorin disks of the embedding vectors which are orthogonal for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
