Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

Guansu Wang; Peijie Sun

arXiv:2511.17555·eess.AS·November 25, 2025

Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

Guansu Wang, Peijie Sun

PDF

Open Access 1 Video

TL;DR

This paper introduces W3AR, a novel method that leverages ASR model attention to provide fine-grained, word-level rewards for improving text-to-speech synthesis quality and robustness, especially for unseen speakers.

Contribution

W3AR is a new approach that uses pre-trained ASR attention for sequence alignment and optimization in TTS, eliminating the need for explicit reward annotations.

Findings

01

W3AR enhances TTS speech quality.

02

W3AR improves zero-shot speaker robustness.

03

Fine-grained rewards lead to better alignment and synthesis.

Abstract

Recent advances in text-to-speech (TTS) have enabled models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, evaluation methods lag behind: typical mean opinion score (MOS) estimators perform regression over entire utterances, while failures usually occur in a few problematic words. We observe that encoder-decoder ASR models (e.g., Whisper) surface word-level mismatches between speech and text via cross-attention, providing a fine-grained reward signal. Building on this, we introduce Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Without explicit reward annotations, W3AR uses attention from a pre-trained ASR model to drive finer-grained alignment and optimization of sequences predicted by a TTS model. Experiments show that W3AR improves the quality of existing TTS systems and strengthens zero-shot robustness on unseen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling