Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition
Zhehuai Chen, Jasha Droppo, Jinyu Li, Wayne Xiong

TL;DR
This paper introduces a progressive joint modeling framework with a modular neural network structure, transfer learning, and discriminative training to improve unsupervised single-channel overlapped speech recognition, achieving significant WER reduction.
Contribution
It proposes a novel modular, progressive training approach with transfer learning and discriminative objectives for overlapped speech recognition, advancing beyond existing PIT methods.
Findings
Over 30% relative WER reduction on overlapped speech datasets.
Enhanced model generalization and training efficiency.
Effective integration of sequence-level linguistic knowledge.
Abstract
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
