DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning   Text-to-Speech Diffusion Models

Jingyi Chen; Ju-Seung Byun; Micha Elsner; Andrew Perrault

arXiv:2405.14632·cs.LG·November 19, 2024

DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

Jingyi Chen, Ju-Seung Byun, Micha Elsner, Andrew Perrault

PDF

Open Access

TL;DR

This paper introduces DLPO, a reinforcement learning method guided by diffusion model loss, to improve the quality and naturalness of diffusion-based text-to-speech synthesis, demonstrating its effectiveness through objective and human evaluations.

Contribution

The paper presents DLPO, a novel RL policy optimization technique guided by diffusion model loss, specifically designed for fine-tuning speech synthesis models.

Findings

01

RLHF improves diffusion-based speech synthesis quality

02

DLPO outperforms other RLHF methods in naturalness and quality

03

Enhanced speech naturalness confirmed by human preference tests

Abstract

Recent advancements in generative models have sparked a significant interest within the machine learning community. Particularly, diffusion models have demonstrated remarkable capabilities in synthesizing images and speech. Studies such as those by Lee et al. (2023), Black et al. (2023), Wang et al. (2023), and Fan et al. (2024) illustrate that Reinforcement Learning with Human Feedback (RLHF) can enhance diffusion models for image synthesis. However, due to architectural differences between these models and those employed in speech synthesis, it remains uncertain whether RLHF could similarly benefit speech synthesis models. In this paper, we explore the practical application of RLHF to diffusion-based text-to-speech synthesis, leveraging the mean opinion score (MOS) as predicted by UTokyo-SaruLab MOS prediction system (Saeki et al., 2022) as a proxy loss. We introduce diffusion model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsDiffusion