Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures
Ragini Sinha, Marvin Tammen, Christian Rollwage, Simon Doclo

TL;DR
This paper introduces two conformer-based architectures for single-channel target speaker extraction, demonstrating that the TCN-Conformer significantly outperforms other models in various mixture scenarios.
Contribution
The paper proposes novel conformer-based architectures for target speaker extraction, including TCN-Conformer, trained end-to-end for improved performance in complex audio mixtures.
Findings
TCN-Conformer outperforms other models in speaker extraction accuracy
End-to-end training enhances system performance
Effective in 2-, 3-speaker, and noisy mixtures
Abstract
Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
