Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

Thilo von Neumann; Christoph Boeddeker; Marc Delcroix; Reinhold Haeb-Umbach

arXiv:2508.02112·eess.AS·August 5, 2025

Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

PDF

TL;DR

This paper reviews, compares, and proposes new algorithms for various Word Error Rate metrics tailored for long-form multi-talker speech recognition, addressing speaker confusion and computational efficiency.

Contribution

It provides a unified description of existing WERs, introduces the DI-cpWER to measure speaker confusion impact, and offers efficient algorithms for complex WER computations.

Findings

01

Unified description of WER variants for multi-talker speech

02

Introduction of DI-cpWER to isolate speaker confusion errors

03

Greedy algorithms achieve high-precision approximations with polynomial complexity

Abstract

The predominant metric for evaluating speech recognizers, the Word Error Rate (WER) has been extended in different ways to handle transcripts produced by long-form multi-talker speech recognizers. These systems process long transcripts containing multiple speakers and complex speaking patterns so that the classical WER cannot be applied. There are speaker-attributed approaches that count speaker confusion errors, such as the concatenated minimum-permutation WER cpWER and the time-constrained cpWER (tcpWER), and speaker-agnostic approaches, which aim to ignore speaker confusion errors, such as the Optimal Reference Combination WER (ORC-WER) and the MIMO-WER. These WERs evaluate different aspects and error types (e.g., temporal misalignment). A detailed comparison has not been made. We therefore present a unified description of the existing WERs and highlight when to use which metric. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.