Towards Frame-level Quality Predictions of Synthetic Speech
Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

TL;DR
This paper explores the feasibility of automatic frame-level speech quality prediction, aiming to improve explainability in speech synthesis assessment by proposing criteria and evaluating predictors against human annotations.
Contribution
It identifies issues in existing predictors, defines criteria for effective frame-level quality prediction, and introduces a chunk-based processing method to enhance localization performance.
Findings
Frame-level predictors can outperform crowd-sourced human annotations in localization tasks.
Chunk-based processing improves the robustness of frame-level quality predictions.
Existing predictors face challenges that this work begins to address.
Abstract
While automatic subjective speech quality assessment has witnessed much progress, an open question is whether an automatic quality assessment at frame resolution is possible. This would be highly desirable, as it adds explainability to the assessment of speech synthesis systems. Here, we take first steps towards this goal by identifying issues of existing quality predictors that prevent sensible frame-level prediction. Further, we define criteria that a frame-level predictor should fulfill. We also suggest a chunk-based processing that avoids the impact of a localized distortion on the score of neighboring frames. Finally, we measure in experiments with localized artificial distortions the localization performance of a set of frame-level quality predictors and show that they can outperform detection performance of human annotations obtained from a crowd-sourced perception experiment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
