DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Yinghao Aaron Li; Xilin Jiang; Fei Tao; Cheng Niu; Kaifeng Xu; Juntong Song; Nima Mesgarani

arXiv:2507.14988·eess.AS·July 22, 2025·AAAI

DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Yinghao Aaron Li, Xilin Jiang, Fei Tao, Cheng Niu, Kaifeng Xu, Juntong Song, Nima Mesgarani

PDF

TL;DR

DMOSpeech 2 advances speech synthesis by applying reinforcement learning to optimize duration prediction and introduces a hybrid sampling method, resulting in improved quality, diversity, and efficiency in metric-optimized TTS systems.

Contribution

It extends metric optimization to the duration predictor using reinforcement learning and introduces teacher-guided sampling for better diversity and efficiency.

Findings

01

Superior performance across all metrics compared to previous systems.

02

Reduces sampling steps by half without quality loss.

03

Effective optimization of duration prediction component.

Abstract

Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.