RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer
Neeraj Matiyali, Siddharth Srivastava, Gaurav Sharma

TL;DR
RephraseTTS introduces a transformer-based, non-autoregressive method for text-conditioned speech insertion that dynamically determines speech length, preserves speaker style, and outperforms existing adaptive TTS baselines.
Contribution
It is the first to enable variable-length speech insertion conditioned on text and partial speech, maintaining speaker style and prosody during insertion.
Findings
Outperforms existing adaptive TTS baselines in experiments
Capable of dynamic speech length determination during inference
Produces high-quality speech insertions as confirmed by user study
Abstract
We propose a method for the task of text-conditioned speech insertion, i.e. inserting a speech sample in an input speech sample, conditioned on the corresponding complete text transcript. An example use case of the task would be to update the speech audio when corrections are done on the corresponding text transcript. The proposed method follows a transformer-based non-autoregressive approach that allows speech insertions of variable lengths, which are dynamically determined during inference, based on the text transcript and tempo of the available partial input. It is capable of maintaining the speaker's voice characteristics, prosody and other spectral properties of the available speech input. Results from our experiments and user study on LibriTTS show that our method outperforms baselines based on an existing adaptive text to speech method. We also provide numerous qualitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
