Syllable Subword Tokens for Open Vocabulary Speech Recognition in   Malayalam

Kavya Manohar; A. R. Jayan; Rajeev Rajan

arXiv:2301.06736·cs.CL·January 18, 2023

Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam

Kavya Manohar, A. R. Jayan, Rajeev Rajan

PDF

Open Access 1 Repo

TL;DR

This paper explores the use of syllable-based subword tokens in Malayalam speech recognition to handle the language's morphological complexity, aiming to improve vocabulary coverage and reduce model size.

Contribution

It introduces syllable subword tokens for Malayalam ASR and evaluates their impact on lexicon size, memory, and accuracy, demonstrating advantages over word-based models.

Findings

01

Reduced lexicon size and memory requirements.

02

Improved word error rate with syllable subword tokens.

03

Enhanced handling of out-of-vocabulary words.

Abstract

In a hybrid automatic speech recognition (ASR) system, a pronunciation lexicon (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences. Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms. Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words. In this work we investigate the impact of using syllables as subword tokens instead of words in Malayalam ASR, and evaluate the relative improvement in lexicon size, model memory requirement and word error rate.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/kavyamanohar/ml-subword-asr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling