Continuous Speech Separation with Recurrent Selective Attention Network
Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong, Meng, Jinyu Li

TL;DR
This paper introduces RSAN-based continuous speech separation that adaptively determines the number of output channels and leverages previous separation results, leading to improved speech recognition accuracy over traditional fixed-channel methods.
Contribution
The paper proposes a novel RSAN model for CSS that dynamically adjusts output channels and incorporates block-wise dependencies, enhancing separation performance.
Findings
RSAN-CSS outperforms PIT-based models on LibriCSS.
Block-wise dependency modeling further improves results.
Adaptive channel count reduces speech leakage and separation failures.
Abstract
While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
