Revisiting the Entropy Semiring for Neural Speech Recognition
Oscar Chang, Dongseong Hwang, Olivier Siohan

TL;DR
This paper explores the use of the entropy semiring to improve neural speech recognition by quantifying alignment uncertainty, enabling regularization and distillation, which leads to state-of-the-art results in streaming speech recognition.
Contribution
It revisits the entropy semiring for neural speech recognition and introduces an open-source implementation with stable, parallel variants, demonstrating improved accuracy and latency.
Findings
Alignment distillation improves model accuracy.
State-of-the-art performance on Librispeech streaming tasks.
Open-source semiring implementation for CTC and RNN-T.
Abstract
In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Topic Modeling
