Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition
Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

TL;DR
This paper reviews, compares, and proposes new algorithms for various Word Error Rate metrics tailored for long-form multi-talker speech recognition, addressing speaker confusion and computational efficiency.
Contribution
It provides a unified description of existing WERs, introduces the DI-cpWER to measure speaker confusion impact, and offers efficient algorithms for complex WER computations.
Findings
Unified description of WER variants for multi-talker speech
Introduction of DI-cpWER to isolate speaker confusion errors
Greedy algorithms achieve high-precision approximations with polynomial complexity
Abstract
The predominant metric for evaluating speech recognizers, the Word Error Rate (WER) has been extended in different ways to handle transcripts produced by long-form multi-talker speech recognizers. These systems process long transcripts containing multiple speakers and complex speaking patterns so that the classical WER cannot be applied. There are speaker-attributed approaches that count speaker confusion errors, such as the concatenated minimum-permutation WER cpWER and the time-constrained cpWER (tcpWER), and speaker-agnostic approaches, which aim to ignore speaker confusion errors, such as the Optimal Reference Combination WER (ORC-WER) and the MIMO-WER. These WERs evaluate different aspects and error types (e.g., temporal misalignment). A detailed comparison has not been made. We therefore present a unified description of the existing WERs and highlight when to use which metric. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
