Speaker-conditioned Target Speaker Extraction based on Customized LSTM   Cells

Ragini Sinha (1); Marvin Tammen (2); Christian Rollwage (1); Simon; Doclo (1; 2) ((1) Fraunhofer Institute for Digital Media Technology,; Project group Hearing; Speech; Audio Technology; Oldenburg; Germany; (2); Dept. of Medical Physics; Acoustics; Cluster of Excellence Hearing4all,; University of Oldenburg; Germany)

arXiv:2104.04234·eess.AS·April 12, 2021·ITG Conference on Speech Communication·1 cites

Speaker-conditioned Target Speaker Extraction based on Customized LSTM Cells

Ragini Sinha (1), Marvin Tammen (2), Christian Rollwage (1), Simon, Doclo (1, 2) ((1) Fraunhofer Institute for Digital Media Technology,, Project group Hearing, Speech, Audio Technology, Oldenburg, Germany, (2), Dept. of Medical Physics, Acoustics

PDF

Open Access

TL;DR

This paper introduces a novel approach for target speaker extraction using customized LSTM cells within a CNN-LSTM network, significantly enhancing extraction accuracy by focusing on target-specific voice patterns.

Contribution

It proposes a new method of customizing LSTM cells to better remember target speaker characteristics, improving extraction performance over standard LSTM models.

Findings

01

Customized LSTM cells outperform standard LSTM in speaker extraction tasks.

02

Significant performance improvement on Librispeech two-speaker mixtures.

03

Effective focus on target speaker voice patterns enhances extraction accuracy.

Abstract

Speaker-conditioned target speaker extraction systems rely on auxiliary information about the target speaker to extract the target speaker signal from a mixture of multiple speakers. Typically, a deep neural network is applied to isolate the relevant target speaker characteristics. In this paper, we focus on a single-channel target speaker extraction system based on a CNN-LSTM separator network and a speaker embedder network requiring reference speech of the target speaker. In the LSTM layer of the separator network, we propose to customize the LSTM cells in order to only remember the specific voice patterns corresponding to the target speaker by modifying the information processing in the forget gate. Experimental results for two-speaker mixtures using the Librispeech dataset show that this customization significantly improves the target speaker extraction performance compared to using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory