A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models
Mohammad Zeineldeen, Albert Zeyer, Wei Zhou, Thomas Ng, Ralf, Schl\"uter, Hermann Ney

TL;DR
This paper systematically compares grapheme-based and phoneme-based label units in encoder-decoder-attention models for speech recognition, analyzing their performance and differences on standard benchmarks.
Contribution
It provides a comprehensive comparison between grapheme and phoneme output units, including the use of phoneme groups and auxiliary units for homophone distinction.
Findings
Phoneme-based models are competitive with grapheme-based models.
Using phoneme groups can improve recognition performance.
Auxiliary units help distinguish homophones effectively.
Abstract
Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsByte Pair Encoding
