Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

Elizaveta Kostenok; Mathieu Salzmann; Milos Cernak

arXiv:2603.10175·eess.AS·March 12, 2026

Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak

PDF

Open Access

TL;DR

This paper presents a calibration and reinforcement learning framework that enhances speech quality assessment by enabling multidimensional reasoning, artifact detection, and improved accuracy over existing methods.

Contribution

It introduces a novel post-training approach combining calibration and reinforcement learning to improve perceptual dimension prediction and artifact localization in speech quality assessment.

Findings

01

Achieved state-of-the-art 0.71 PCC on QualiSpeech benchmark.

02

Improved MOS prediction accuracy by 13% using RL-based reasoning.

03

Enhanced temporal localization and classification of audio artifacts.

Abstract

Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis