X-CrossNet: A complex spectral mapping approach to target speaker   extraction with cross attention speaker embedding fusion

Chang Sun; Bo Qin

arXiv:2411.13811·cs.SD·November 26, 2024

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Chang Sun, Bo Qin

PDF

Open Access

TL;DR

X-CrossNet introduces a novel spectral mapping approach with cross-attention fusion for target speaker extraction, significantly improving robustness in noisy and reverberant environments compared to existing methods.

Contribution

The paper presents X-CrossNet, a new TSE model that combines CrossNet with a cross-attention mechanism to enhance feature integration and performance in challenging acoustic conditions.

Findings

01

Outperforms existing models on WSJ0-2mix and WHAMR! datasets.

02

Demonstrates robustness in noisy and reverberant environments.

03

Achieves state-of-the-art results in target speaker extraction.

Abstract

Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. It is another attempt at addressing the cocktail party problem and is generally considered to have more practical application prospects than traditional speech separation methods. Although academic research in this area has achieved high performance and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing