Utterance-level Permutation Invariant Training with Latency-controlled   BLSTM for Single-channel Multi-talker Speech Separation

Lu Huang; Gaofeng Cheng; Pengyuan Zhang; Yi Yang; Shumin; Xu; Jiasong Sun

arXiv:1912.11613·cs.SD·December 30, 2019·1 cites

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

Lu Huang, Gaofeng Cheng, Pengyuan Zhang, Yi Yang, Shumin, Xu, Jiasong Sun

PDF

Open Access

TL;DR

This paper introduces a latency-controlled BLSTM approach for single-channel multi-talker speech separation, balancing low latency with high performance, and compares training strategies to optimize results.

Contribution

It proposes using latency-controlled BLSTM during inference and compares chunk-level PIT with utterance-level PIT for improved speech separation.

Findings

01

uPIT outperforms cPIT with LC-BLSTM during inference.

02

Inter-chunk speaker tracing enhances uPIT-LC-BLSTM performance.

03

SDR gap between uPIT-BLSTM and uPIT-LC-BLSTM is within 0.7 dB.

Abstract

Utterance-level permutation invariant training (uPIT) has achieved promising progress on single-channel multi-talker speech separation task. Long short-term memory (LSTM) and bidirectional LSTM (BLSTM) are widely used as the separation networks of uPIT, i.e. uPIT-LSTM and uPIT-BLSTM. uPIT-LSTM has lower latency but worse performance, while uPIT-BLSTM has better performance but higher latency. In this paper, we propose using latency-controlled BLSTM (LC-BLSTM) during inference to fulfill low-latency and good-performance speech separation. To find a better training strategy for BLSTM-based separation network, chunk-level PIT (cPIT) and uPIT are compared. The experimental results show that uPIT outperforms cPIT when LC-BLSTM is used during inference. It is also found that the inter-chunk speaker tracing (ST) can further improve the separation performance of uPIT-LC-BLSTM. Evaluated on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

Methodsutterance level permutation invariant training · Sigmoid Activation · Tanh Activation · Long Short-Term Memory