Improved Source Counting and Separation for Monaural Mixture

Yiming Xiao; Haijian Zhang

arXiv:2004.00175·eess.AS·April 2, 2020·6 cites

Improved Source Counting and Separation for Monaural Mixture

Yiming Xiao, Haijian Zhang

PDF

Open Access

TL;DR

This paper introduces a novel single-channel speech separation model that jointly estimates the number of speakers and separates their voices, achieving high accuracy and state-of-the-art results.

Contribution

A new model integrating time-frequency features and speaker counting via Gerschgorin disks, enabling accurate speaker number estimation and separation in monaural mixtures.

Findings

01

96.7% probability of correctly estimating speaker count

02

State-of-the-art SI-SNRi and SDRi performance on GRID dataset

03

Effective joint learning of speaker counting and separation

Abstract

Single-channel speech separation in time domain and frequency domain has been widely studied for voice-driven applications over the past few years. Most of previous works assume known number of speakers in advance, however, which is not easily accessible through monaural mixture in practice. In this paper, we propose a novel model of single-channel multi-speaker separation by jointly learning the time-frequency feature and the unknown number of speakers. Specifically, our model integrates the time-domain convolution encoded feature map and the frequency-domain spectrogram by attention mechanism, and the integrated features are projected into high-dimensional embedding vectors which are then clustered with deep attractor network to modify the encoded feature. Meanwhile, the number of speakers is counted by computing the Gerschgorin disks of the embedding vectors which are orthogonal for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis