Dual-Path Modeling for Long Recording Speech Separation in Meetings

Chenda Li; Zhuo Chen; Yi Luo; Cong Han; Tianyan Zhou; Keisuke; Kinoshita; Marc Delcroix; Shinji Watanabe; Yanmin Qian

arXiv:2102.11634·eess.AS·February 24, 2021

Dual-Path Modeling for Long Recording Speech Separation in Meetings

Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke, Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces a transformer-based dual-path speech separation model for long recordings, improving dependency modeling and reducing computation, leading to better speech separation and recognition in meetings.

Contribution

It extends dual-path modeling with transformer layers for continuous speech separation, achieving improved accuracy and efficiency in long, overlapped recordings.

Findings

01

Consistent WER reduction on LibriCSS dataset.

02

Dual-path transformer with convolutional layers reduces computation by 30%.

03

Online models achieve 10% relative WER improvement.

Abstract

The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing