Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study
Zijian Yang, J\"org Barkoczi, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper develops a theoretical framework for unsupervised speech recognition, establishing conditions for success, deriving error bounds, and proposing a sequence-level training method grounded in classification error analysis.
Contribution
It introduces a theoretical analysis of unsupervised speech recognition, identifying key conditions for success and proposing a novel sequence-level training loss.
Findings
Derived classification error bounds for unsupervised speech recognition
Validated the theoretical bounds through simulations
Proposed a new sequence-level cross-entropy loss for training
Abstract
Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Face and Expression Recognition · Speech and Audio Processing
