What is lost in Normalization? Exploring Pitfalls in Multilingual ASR   Model Evaluations

Kavya Manohar; Leena G Pillai; Elizabeth Sherly

arXiv:2409.02449·cs.CL·November 12, 2024

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

Kavya Manohar, Leena G Pillai, Elizabeth Sherly

PDF

Open Access 1 Video

TL;DR

This paper critically examines the flaws in current text normalization practices used in evaluating multilingual ASR models, especially for Indic scripts, revealing how these practices can distort performance metrics and proposing more linguistically informed normalization methods.

Contribution

It identifies specific pitfalls in existing normalization routines for Indic scripts and advocates for linguistically informed normalization to improve evaluation accuracy.

Findings

01

Normalization routines can artificially inflate performance metrics.

02

Current practices often ignore linguistic nuances of Indic scripts.

03

Proposed normalization methods improve evaluation robustness.

Abstract

This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. We conclude by proposing a shift towards developing text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Interpreting and Communication in Healthcare · Natural Language Processing Techniques

MethodsFocus