Hierarchical speaker representation for target speaker extraction
Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang,, Xueliang Zhang

TL;DR
This paper introduces Hierarchical Representation (HR), a novel method that fuses anchor data across multiple layers to improve target speaker extraction, significantly outperforming existing techniques and winning a major challenge.
Contribution
The paper proposes a hierarchical fusion approach for speaker embeddings, enhancing target speaker extraction beyond traditional simple vector representations.
Findings
HR outperforms state-of-the-art time-frequency domain methods on Libri-2talker.
Achieved first place in ICASSP 2023 Deep Noise Suppression Challenge.
Hierarchical fusion improves anchor utilization for better speaker isolation.
Abstract
Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
