Adaptive Duration Model for Text Speech Alignment
Junjie Cao

TL;DR
This paper introduces an adaptive duration prediction model for text-to-speech alignment that improves phoneme-level accuracy and robustness, especially in zero-shot TTS scenarios, by better capturing duration distributions.
Contribution
The paper presents a novel duration prediction framework that enhances phoneme-level alignment accuracy and adapts better to varying conditions in neural TTS models.
Findings
More precise phoneme-level duration predictions.
Improved alignment accuracy over baseline models.
Enhanced robustness of zero-shot TTS to audio mismatches.
Abstract
Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
