Hierarchical speaker representation for target speaker extraction

Shulin He; Huaiwen Zhang; Wei Rao; Kanghao Zhang; Yukai Ju; Yang Yang,; Xueliang Zhang

arXiv:2210.15849·cs.SD·January 8, 2024

Hierarchical speaker representation for target speaker extraction

Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang,, Xueliang Zhang

PDF

Open Access

TL;DR

This paper introduces Hierarchical Representation (HR), a novel method that fuses anchor data across multiple layers to improve target speaker extraction, significantly outperforming existing techniques and winning a major challenge.

Contribution

The paper proposes a hierarchical fusion approach for speaker embeddings, enhancing target speaker extraction beyond traditional simple vector representations.

Findings

01

HR outperforms state-of-the-art time-frequency domain methods on Libri-2talker.

02

Achieved first place in ICASSP 2023 Deep Noise Suppression Challenge.

03

Hierarchical fusion improves anchor utilization for better speaker isolation.

Abstract

Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing