Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals
Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang,, Haizhou Li

TL;DR
This paper introduces a multi-stage speaker extraction method that leverages short reference samples and frame-level embeddings, improving performance in noisy and clean conditions over existing techniques.
Contribution
It presents a novel multi-stage extraction framework using frame-level references and a signal fusion scheme, advancing speaker extraction capabilities with short reference samples.
Findings
Outperforms state-of-the-art baselines on WSJ0-2mix, WHAM!, and WHAMR! datasets.
Effective use of frame-level embeddings enhances speaker extraction accuracy.
Signal fusion improves the quality of extracted speech across multiple scales.
Abstract
Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
