Speaker-conditioning Single-channel Target Speaker Extraction using   Conformer-based Architectures

Ragini Sinha; Marvin Tammen; Christian Rollwage; Simon Doclo

arXiv:2205.13851·eess.AS·May 30, 2022·IWAENC

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Ragini Sinha, Marvin Tammen, Christian Rollwage, Simon Doclo

PDF

Open Access

TL;DR

This paper introduces two conformer-based architectures for single-channel target speaker extraction, demonstrating that the TCN-Conformer significantly outperforms other models in various mixture scenarios.

Contribution

The paper proposes novel conformer-based architectures for target speaker extraction, including TCN-Conformer, trained end-to-end for improved performance in complex audio mixtures.

Findings

01

TCN-Conformer outperforms other models in speaker extraction accuracy

02

End-to-end training enhances system performance

03

Effective in 2-, 3-speaker, and noisy mixtures

Abstract

Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing