Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition
Wei Zhou, Simon Berger, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper introduces a phoneme-based neural transducer model for large vocabulary speech recognition that combines classical and end-to-end methods, achieving competitive results with a simple, efficient approach.
Contribution
It proposes a novel phoneme-based neural transducer with label augmentation, simplified neural structure, and an effective training procedure, advancing speech recognition performance.
Findings
Achieves performance comparable to state-of-the-art on TED-LIUM and Switchboard datasets.
Demonstrates the effectiveness of phoneme label augmentation and simplified neural architecture.
Shows that a phonetic context size of one suffices for optimal results.
Abstract
To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
