SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

Zhengyan Sheng; Zhihao Du; Shiliang Zhang; Zhijie Yan; Liping Chen

arXiv:2502.11094·cs.SD·March 17, 2026

SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Liping Chen

PDF

Open Access

TL;DR

SyncSpeech introduces a novel TTS model that combines autoregressive and non-autoregressive strengths, achieving high efficiency and low latency while maintaining speech quality, through a Temporal Mask Transformer paradigm and innovative training strategies.

Contribution

The paper proposes SyncSpeech, a TTS model based on the Temporal Mask Transformer that unifies AR and NAR advantages with a new sequence construction, training objective, and hybrid attention mask.

Findings

01

Achieves 5.8-fold reduction in first-packet latency.

02

Attains 8.8-fold improvement in real-time factor.

03

Maintains speech quality comparable to AR TTS models.

Abstract

Current text-to-speech (TTS) models face a persistent limitation: autoregressive (AR) models suffer from low generation efficiency, while modern non-autoregressive (NAR) models experience high latency due to their unordered temporal nature. To bridge this divide, we introduce SyncSpeech, an efficient and low-latency TTS model based on the proposed Temporal Mask Transformer (TMT) paradigm. TMT synergistically unifies the temporally ordered generation of AR models with the parallel decoding efficiency of NAR models. TMT is realized through a meticulously designed sequence construction rule, a corresponding training objective, and a specialized hybrid attention mask. Furthermore, with the primary aim of enhancing training efficiency, a high-probability masking strategy is introduced, which also leads to a significant improvement in overall model performance. During inference, SyncSpeech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems