Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy
Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

TL;DR
This paper introduces DCAR, a dynamic chunk-wise autoregressive framework for speech synthesis that improves efficiency and robustness, enabling faster and more intelligible speech generation compared to traditional models.
Contribution
The paper proposes a novel dynamic chunk-wise prediction policy with a lightweight on-policy training module, enhancing AR speech synthesis stability and speed.
Findings
Achieves up to 72.27% improvement in speech intelligibility.
Provides 2.61x faster inference speed.
Outperforms traditional AR models in quality and efficiency.
Abstract
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
