Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment

Wei Wang; Wangyou Zhang; Chenda Li; Jiatong Shi; Shinji Watanabe; Yanmin Qian

arXiv:2506.12260·cs.SD·August 25, 2025

Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment

Wei Wang, Wangyou Zhang, Chenda Li, Jiatong Shi, Shinji Watanabe, Yanmin Qian

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a speech enhancement training method that uses a learned speech quality assessment model predicting multiple metrics, leading to improved perceptual quality and better generalization compared to traditional objectives.

Contribution

The work proposes a novel SQA-guided training framework for speech enhancement that leverages multi-metric supervision, addressing limitations of conventional objectives and enabling training on real-world data.

Findings

01

SQA-guided training improves speech quality across multiple metrics.

02

The approach generalizes better to real-world data.

03

It outperforms traditional training objectives like SI-SNR.

Abstract

Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we investigate a training framework that leverages a SQA model, trained to predict multiple evaluation metrics from a public SE leaderboard, as a supervisory signal for SE. This approach addresses a key limitation of conventional SE objectives, such as SI-SNR, which often fail to align with perceptual quality and generalize poorly across evaluation metrics. Moreover, it enables training on real-world data where clean references are unavailable. Experiments on both simulated and real-world test sets show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing