Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

Jingyi Chen; Ju Seung Byun; Micha Elsner; Pichao Wang; Andrew Perrault

arXiv:2508.03123·cs.SD·August 6, 2025

Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault

PDF

TL;DR

This paper introduces DLPO, a reinforcement learning framework that enhances diffusion-based text-to-speech models by improving naturalness and efficiency, making real-time high-quality speech synthesis feasible.

Contribution

The paper presents DLPO, a novel RLHF method that integrates diffusion model loss into reward optimization, significantly improving speech quality and efficiency in TTS diffusion models.

Findings

01

Achieved higher objective speech quality metrics (UTMOS 3.65, NISQA 4.02)

02

DLPO preferred in 67% of subjective evaluations

03

Enhanced real-time speech synthesis performance

Abstract

Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67\% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.