P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge

Marvin Sach; Yihui Fu; Kohei Saijo; Wangyou Zhang; Samuele Cornell; Robin Scheibler; Chenda Li; Anurag Kumar; Wei Wang; Yanmin Qian; Shinji Watanabe; Tim Fingscheidt

arXiv:2507.11306·eess.AS·July 28, 2025

P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge

Marvin Sach, Yihui Fu, Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Anurag Kumar, Wei Wang, Yanmin Qian, Shinji Watanabe, Tim Fingscheidt

PDF

Open Access 3 Datasets

TL;DR

This paper reviews the ITU-T P.808 subjective testing method, proposes localization improvements for multilingual speech enhancement evaluation, analyzes URGENT 2025 Challenge results, and discusses the reliability of subjective metrics in the era of generative AI.

Contribution

It introduces a novel process for localizing text and audio in crowdsourced subjective tests and provides insights into the reliability of subjective metrics for generative speech enhancement methods.

Findings

01

Subjective (ACR MOS) and objective metrics should be complemented with phone fidelity metrics for generative SE.

02

Analysis of URGENT Challenge results reveals issues with current subjective testing reliability.

03

Localization scripts and methods will be released for multilingual speech enhancement evaluation.

Abstract

In speech quality estimation for speech enhancement (SE) systems, subjective listening tests so far are considered as the gold standard. This should be even more true considering the large influx of new generative or hybrid methods into the field, revealing issues of some objective metrics. Efforts such as the Interspeech 2025 URGENT Speech Enhancement Challenge also involving non-English datasets add the aspect of multilinguality to the testing procedure. In this paper, we provide a brief recap of the ITU-T P.808 crowdsourced subjective listening test method. A first novel contribution is our proposed process of localizing both text and audio components of Naderi and Cutler's implementation of crowdsourced subjective absolute category rating (ACR) listening tests involving text-to-speech (TTS). Further, we provide surprising analyses of and insights into URGENT Challenge results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis