asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

Oleg Sedukhin; Andrey Kostin

arXiv:2601.20992·cs.CL·January 30, 2026

asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

Oleg Sedukhin, Andrey Kostin

PDF

Open Access

TL;DR

This paper introduces new algorithms and tools for evaluating speech recognition, especially for non-Latin languages and longform speech, including a novel alignment method, a new Russian speech dataset, and visualization tools.

Contribution

It presents a string alignment algorithm supporting multi-reference labeling and long insertions, along with a new Russian speech dataset and evaluation tools for streaming recognition.

Findings

01

Improved word alignment supports non-Latin languages.

02

New Russian speech dataset with multi-reference labeling.

03

Tools for visual comparison of transcriptions.

Abstract

We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis