Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai,, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao,, Naoyuki Kanda

TL;DR
This paper introduces a total-duration-aware duration model for TTS that predicts phoneme durations considering total speech duration, improving speech quality and speaker similarity across different speech rates.
Contribution
It proposes a novel TDA duration model and a MaskGIT-based approach, enhancing duration prediction diversity and quality in TTS systems.
Findings
Improved intelligibility and speaker similarity across speech rates
Higher quality and diversity in phoneme duration predictions
Enhanced control over total speech duration in TTS
Abstract
Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations are predicted not only from the text input but also from an additional input of the total target duration. We also propose a MaskGIT-based duration model that enhances the diversity and quality of the predicted phoneme durations. Our results demonstrate that the proposed TDA duration models achieve better intelligibility and speaker similarity for various speech rate configurations compared to the baseline models. We also show that the proposed MaskGIT-based model can generate phoneme durations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
