asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation
Oleg Sedukhin, Andrey Kostin

TL;DR
This paper introduces new algorithms and tools for evaluating speech recognition, especially for non-Latin languages and longform speech, including a novel alignment method, a new Russian speech dataset, and visualization tools.
Contribution
It presents a string alignment algorithm supporting multi-reference labeling and long insertions, along with a new Russian speech dataset and evaluation tools for streaming recognition.
Findings
Improved word alignment supports non-Latin languages.
New Russian speech dataset with multi-reference labeling.
Tools for visual comparison of transcriptions.
Abstract
We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
