Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Jianbo Ma; Richard Cartwright

arXiv:2604.19330·eess.AS·April 30, 2026

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Jianbo Ma, Richard Cartwright

PDF

TL;DR

This paper introduces Chain-of-Details, a novel cascaded framework for TTS that models temporal dynamics at multiple granularities, improving naturalness with fewer parameters.

Contribution

The paper proposes a new coarse-to-fine temporal modeling approach in TTS, enabling efficient and natural speech synthesis without explicit phoneme duration predictors.

Findings

01

CoD achieves competitive performance with fewer parameters.

02

Explicit temporal modeling enhances speech naturalness.

03

Lowest detail level performs phonetic planning inherently.

Abstract

Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.