Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

Wei Zhou; Simon Berger; Ralf Schl\"uter; Hermann Ney

arXiv:2010.16368·cs.CL·April 21, 2021·1 cites

Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

Wei Zhou, Simon Berger, Ralf Schl\"uter, Hermann Ney

PDF

Open Access

TL;DR

This paper introduces a phoneme-based neural transducer model for large vocabulary speech recognition that combines classical and end-to-end methods, achieving competitive results with a simple, efficient approach.

Contribution

It proposes a novel phoneme-based neural transducer with label augmentation, simplified neural structure, and an effective training procedure, advancing speech recognition performance.

Findings

01

Achieves performance comparable to state-of-the-art on TED-LIUM and Switchboard datasets.

02

Demonstrates the effectiveness of phoneme label augmentation and simplified neural architecture.

03

Shows that a phonetic context size of one suffices for optimal results.

Abstract

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing