Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Sefik Emre Eskimez; Xiaofei Wang; Manthan Thakker; Chung-Hsien Tsai,; Canrun Li; Zhen Xiao; Hemin Yang; Zirun Zhu; Min Tang; Jinyu Li; Sheng Zhao,; Naoyuki Kanda

arXiv:2406.04281·eess.AS·June 7, 2024·Interspeech

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai,, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao,, Naoyuki Kanda

PDF

Open Access

TL;DR

This paper introduces a total-duration-aware duration model for TTS that predicts phoneme durations considering total speech duration, improving speech quality and speaker similarity across different speech rates.

Contribution

It proposes a novel TDA duration model and a MaskGIT-based approach, enhancing duration prediction diversity and quality in TTS systems.

Findings

01

Improved intelligibility and speaker similarity across speech rates

02

Higher quality and diversity in phoneme duration predictions

03

Enhanced control over total speech duration in TTS

Abstract

Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations are predicted not only from the text input but also from an additional input of the total target duration. We also propose a MaskGIT-based duration model that enhances the diversity and quality of the predicted phoneme durations. Our results demonstrate that the proposed TDA duration models achieve better intelligibility and speaker similarity for various speech rate configurations compared to the baseline models. We also show that the proposed MaskGIT-based model can generate phoneme durations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis