L-SpEx: Localized Target Speaker Extraction

Meng Ge; Chenglin Xu; Longbiao Wang; Eng Siong Chng; Jianwu Dang,; Haizhou Li

arXiv:2202.09995·eess.AS·February 22, 2022

L-SpEx: Localized Target Speaker Extraction

Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang,, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

L-SpEx introduces an end-to-end speech extraction method that localizes and extracts a target speaker using only speech cues, leveraging spatial features and attention mechanisms, and significantly improves performance in reverberant multi-channel environments.

Contribution

The paper presents a novel end-to-end localized speaker extraction model that does not rely on visual cues, utilizing a speaker localizer and spatial attention for improved accuracy.

Findings

01

Outperforms baseline systems on MC-Libri2Mix dataset.

02

Effectively estimates direction-of-arrival (DOA) and beamforming output.

03

Enhances target speaker extraction in reverberant multi-channel settings.

Abstract

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gemengtju/l-spex
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing