A systematic comparison of grapheme-based vs. phoneme-based label units   for encoder-decoder-attention models

Mohammad Zeineldeen; Albert Zeyer; Wei Zhou; Thomas Ng; Ralf; Schl\"uter; Hermann Ney

arXiv:2005.09336·eess.AS·April 16, 2021·5 cites

A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models

Mohammad Zeineldeen, Albert Zeyer, Wei Zhou, Thomas Ng, Ralf, Schl\"uter, Hermann Ney

PDF

Open Access 1 Repo

TL;DR

This paper systematically compares grapheme-based and phoneme-based label units in encoder-decoder-attention models for speech recognition, analyzing their performance and differences on standard benchmarks.

Contribution

It provides a comprehensive comparison between grapheme and phoneme output units, including the use of phoneme groups and auxiliary units for homophone distinction.

Findings

01

Phoneme-based models are competitive with grapheme-based models.

02

Using phoneme groups can improve recognition performance.

03

Auxiliary units help distinguish homophones effectively.

Abstract

Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rwth-i6/returnn-experiments
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsByte Pair Encoding