Right Label Context in End-to-End Training of Time-Synchronous ASR Models
Tina Raissi, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper introduces a factored loss function for end-to-end time-synchronous ASR models that incorporates right label context, improving performance especially with limited data and enabling hybrid HMM systems.
Contribution
It proposes a novel factored loss with auxiliary label contexts for end-to-end ASR, addressing normalization issues and enabling hybrid HMM integration.
Findings
Including right label context benefits low-resource training.
The factored loss improves ASR accuracy on Switchboard and LibriSpeech.
Hybrid HMM systems can be trained with the full-sum criterion.
Abstract
Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems
