Leveraging Real Conversational Data for Multi-Channel Continuous Speech   Separation

Xiaofei Wang; Dongmei Wang; Naoyuki Kanda; Sefik Emre Eskimez; Takuya; Yoshioka

arXiv:2204.03232·eess.AS·April 8, 2022

Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya, Yoshioka

PDF

Open Access

TL;DR

This paper introduces a three-stage training scheme for multi-channel continuous speech separation that effectively utilizes both supervised and large-scale unsupervised real conversational data, improving meeting transcription accuracy.

Contribution

A novel semi-supervised training approach combining simulated, transcribed, and real data with teacher-student learning for CSS models.

Findings

01

Steady performance improvements at each training stage.

02

Effective leveraging of real conversational data.

03

Enhanced multi-channel CSS for meeting transcription.

Abstract

Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlapping data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data. The scheme consists of two conventional training approaches -- pre-training using simulated data and ASR-loss-based training using transcribed data -- and a novel continuous semi-supervised training between the two, in which the CSS model is further trained by using real data based on the teacher-student learning framework. We apply this scheme to an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing