Dual-Path Modeling for Long Recording Speech Separation in Meetings
Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke, Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

TL;DR
This paper introduces a transformer-based dual-path speech separation model for long recordings, improving dependency modeling and reducing computation, leading to better speech separation and recognition in meetings.
Contribution
It extends dual-path modeling with transformer layers for continuous speech separation, achieving improved accuracy and efficiency in long, overlapped recordings.
Findings
Consistent WER reduction on LibriCSS dataset.
Dual-path transformer with convolutional layers reduces computation by 30%.
Online models achieve 10% relative WER improvement.
Abstract
The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
