Using RLHF to align speech enhancement approaches to mean-opinion   quality scores

Anurag Kumar; Andrew Perrault; Donald S. Williamson

arXiv:2410.13182·eess.AS·October 18, 2024

Using RLHF to align speech enhancement approaches to mean-opinion quality scores

Anurag Kumar, Andrew Perrault, Donald S. Williamson

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning from human feedback (RLHF) framework to improve speech enhancement models by aligning them more closely with human subjective quality ratings, leading to better overall performance.

Contribution

The study presents a novel RLHF-based fine-tuning method that optimizes speech enhancement models using MOS-based rewards, addressing the misalignment of traditional objective measures.

Findings

01

RLHF-finetuned model outperforms baselines on multiple benchmarks

02

Both policy gradient and MSE losses are crucial for balanced optimization

03

Improved correlation with human subjective ratings

Abstract

Objective speech quality measures are typically used to assess speech enhancement algorithms, but it has been shown that they are sub-optimal as learning objectives because they do not always align well with human subjective ratings. This misalignment often results in noticeable distortions and artifacts that cause speech enhancement to be ineffective. To address these issues, we propose a reinforcement learning from human feedback (RLHF) framework to fine-tune an existing speech enhancement approach by optimizing performance using a mean-opinion score (MOS)-based reward model. Our results show that the RLHF-finetuned model has the best performance across different benchmarks for both objective and MOS-based speech quality assessment metrics on the Voicebank+DEMAND dataset. Through ablation studies, we show that both policy gradient loss and supervised MSE loss are important for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsALIGN