Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Thibault Ba\~neras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

TL;DR
This paper explores the use of decoder-based large language models for evaluating automatic speech recognition, showing they outperform traditional metrics and offer interpretability.
Contribution
It introduces three approaches using LLMs for ASR evaluation and demonstrates their effectiveness over existing metrics.
Findings
LLMs achieve 92-94% agreement with human judgments in hypothesis selection.
LLMs outperform traditional WER and semantic metrics in correlation with human perception.
Decoder-based LLM embeddings perform comparably to encoder models in semantic evaluation.
Abstract
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
