Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning
Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh,, Marlene Staib, Alexandra Torresquintero, Jiameng Gao

TL;DR
This paper introduces a reinforcement learning framework that enables neural sequence-to-sequence text-to-speech models to generate audio incrementally, reducing latency for time-sensitive applications like simultaneous interpretation.
Contribution
It presents a novel reinforcement learning approach to interleave reading and synthesis actions, allowing neural TTS models to operate incrementally rather than waiting for full input sequences.
Findings
The agent effectively balances latency and audio quality.
Incremental neural TTS models can be trained with reinforcement learning.
Performance surpasses rule-based systems in latency-quality trade-offs.
Abstract
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
