Towards Frame-level Quality Predictions of Synthetic Speech

Michael Kuhlmann; Fritz Seebauer; Petra Wagner; Reinhold Haeb-Umbach

arXiv:2508.10374·eess.AS·October 10, 2025·Interspeech

Towards Frame-level Quality Predictions of Synthetic Speech

Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

PDF

TL;DR

This paper explores the feasibility of automatic frame-level speech quality prediction, aiming to improve explainability in speech synthesis assessment by proposing criteria and evaluating predictors against human annotations.

Contribution

It identifies issues in existing predictors, defines criteria for effective frame-level quality prediction, and introduces a chunk-based processing method to enhance localization performance.

Findings

01

Frame-level predictors can outperform crowd-sourced human annotations in localization tasks.

02

Chunk-based processing improves the robustness of frame-level quality predictions.

03

Existing predictors face challenges that this work begins to address.

Abstract

While automatic subjective speech quality assessment has witnessed much progress, an open question is whether an automatic quality assessment at frame resolution is possible. This would be highly desirable, as it adds explainability to the assessment of speech synthesis systems. Here, we take first steps towards this goal by identifying issues of existing quality predictors that prevent sensible frame-level prediction. Further, we define criteria that a frame-level predictor should fulfill. We also suggest a chunk-based processing that avoids the impact of a localized distortion on the score of neighboring frames. Finally, we measure in experiments with localized artificial distortions the localization performance of a set of frame-level quality predictors and show that they can outperform detection performance of human annotations obtained from a crowd-sourced perception experiment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.