Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee

TL;DR
Reinforce-Aligner introduces a reinforcement learning-based internal aligner for end-to-end TTS, improving speech naturalness and fidelity by optimizing phoneme-to-frame duration predictions within a single model.
Contribution
The paper presents a novel reinforcement learning approach for internal duration alignment in end-to-end TTS, eliminating the need for external aligners and enhancing synthesis quality.
Findings
Achieves more accurate phoneme-to-frame alignments.
Outperforms state-of-the-art TTS models in naturalness.
Enhances speech fidelity through reinforcement-based duration search.
Abstract
Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-frame sequence generated from trained agents enhance fidelity and naturalness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
