Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
Wei Zhou, Albert Zeyer, Andr\'e Merboldt, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper proves the theoretical equivalence between RNN-Transducer and segmental models in speech recognition, showing they have the same modeling power and exploring decoding strategies through initial experiments.
Contribution
It establishes the formal equivalence between transducer and segmental models, linking their internal mechanisms and demonstrating their comparable capabilities.
Findings
Blank probabilities correspond to segment length probabilities.
Time-synchronous and label-synchronous decoding strategies have distinct properties.
Transducer and segmental models are theoretically equivalent in modeling power.
Abstract
With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined on the level of label segments directly. While (soft-)attention-based models avoid explicit alignment, transducer and segmental approach internally do model alignment, either by segment hypotheses or, more implicitly, by emitting so-called blank symbols. In this work, we prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power. It is shown that blank probabilities translate into segment length probabilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
