Adaptive Duration Model for Text Speech Alignment

Junjie Cao

arXiv:2507.22612·cs.SD·September 1, 2025

Adaptive Duration Model for Text Speech Alignment

Junjie Cao

PDF

TL;DR

This paper introduces an adaptive duration prediction model for text-to-speech alignment that improves phoneme-level accuracy and robustness, especially in zero-shot TTS scenarios, by better capturing duration distributions.

Contribution

The paper presents a novel duration prediction framework that enhances phoneme-level alignment accuracy and adapts better to varying conditions in neural TTS models.

Findings

01

More precise phoneme-level duration predictions.

02

Improved alignment accuracy over baseline models.

03

Enhanced robustness of zero-shot TTS to audio mismatches.

Abstract

Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.