Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang; J\"org Barkoczi; Ralf Schl\"uter; Hermann Ney

arXiv:2603.02285·cs.SD·March 4, 2026

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang, J\"org Barkoczi, Ralf Schl\"uter, Hermann Ney

PDF

Open Access

TL;DR

This paper develops a theoretical framework for unsupervised speech recognition, establishing conditions for success, deriving error bounds, and proposing a sequence-level training method grounded in classification error analysis.

Contribution

It introduces a theoretical analysis of unsupervised speech recognition, identifying key conditions for success and proposing a novel sequence-level training loss.

Findings

01

Derived classification error bounds for unsupervised speech recognition

02

Validated the theoretical bounds through simulations

03

Proposed a new sequence-level cross-entropy loss for training

Abstract

Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Face and Expression Recognition · Speech and Audio Processing