LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Arindrima Datta; Guanlong Zhao; Bhuvana Ramabhadran; Eugene Weinstein

arXiv:2008.06121·eess.AS·August 17, 2020

LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Arindrima Datta, Guanlong Zhao, Bhuvana Ramabhadran, Eugene Weinstein

PDF

Open Access

TL;DR

This paper introduces a data-driven, grapheme-based LSTM acoustic model for speech recognition that aligns and pronounces words without needing handcrafted phoneme lexicons, showing competitive accuracy and useful alignments.

Contribution

It presents a novel training methodology for grapheme-based speech recognition models that do not require linguistic lexicons, enabling practical and scalable ASR systems.

Findings

01

Grapheme models achieve comparable WER to phoneme models on large datasets.

02

Grapheme models produce high-quality audio-to-grapheme alignments.

03

The approach works effectively across linguistically diverse Indian languages.

Abstract

Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Built with LSTM networks and trained with the cross-entropy loss, the grapheme-output acoustic models we study are also extremely practical for real-world applications as they can be decoded with conventional ASR stack components such as language models and FST decoders, and produce good quality audio-to-grapheme alignments that are useful in many speech applications. We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets, with the advantage that grapheme models do not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory