Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Bohan Li; Zhihan Li; Haoran Wang; Hanglei Zhang; Yiwei Guo; Hankun Wang; Xie Chen; Kai Yu

arXiv:2506.22023·cs.SD·June 30, 2025

Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

PDF

Open Access

TL;DR

This paper introduces DCAR, a dynamic chunk-wise autoregressive framework for speech synthesis that improves efficiency and robustness, enabling faster and more intelligible speech generation compared to traditional models.

Contribution

The paper proposes a novel dynamic chunk-wise prediction policy with a lightweight on-policy training module, enhancing AR speech synthesis stability and speed.

Findings

01

Achieves up to 72.27% improvement in speech intelligibility.

02

Provides 2.61x faster inference speed.

03

Outperforms traditional AR models in quality and efficiency.

Abstract

Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques