Continuous Speech Separation Using Speaker Inventory for Long   Multi-talker Recording

Cong Han; Yi Luo; Chenda Li; Tianyan Zhou; Keisuke Kinoshita; Shinji; Watanabe; Marc Delcroix; Hakan Erdogan; John R. Hershey; Nima Mesgarani; Zhuo; Chen

arXiv:2012.09727·eess.AS·December 21, 2020·6 cites

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji, Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo, Chen

PDF

Open Access

TL;DR

This paper introduces a clustering-based speaker inventory method for long multi-talker recordings, improving speech separation by building speaker models directly from the input without external signals, especially effective in noisy, reverberant environments.

Contribution

It presents a novel self-informed, clustering-based approach to form speaker inventories from long recordings, eliminating the need for pre-enrolled speaker signals.

Findings

01

Significant improvement in separation performance across various conditions.

02

Effective in noisy, reverberant long recordings.

03

Robust speaker embedding extraction from non-overlapped regions.

Abstract

Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all these systems ideally assume that the pre-enrolled speaker signals are available and are only evaluated on simple data configurations. In realistic multi-talker conversations, the speech signal contains a large proportion of non-overlapped regions, where we can derive robust speaker embedding of individual talkers. In this work, we adopt the SSUSI model in long recordings and propose a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing