An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su; Tien-Ping Tan; Salima Mdhaffar; Yannick Est\`eve; Aghilas Sini

arXiv:2604.12438·eess.AS·April 15, 2026

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su, Tien-Ping Tan, Salima Mdhaffar, Yannick Est\`eve, Aghilas Sini

PDF

TL;DR

This paper introduces a novel end-to-end speech synthesis architecture that achieves ultra-low latency and high fidelity by block-wise generation and depth-wise codec decoding, suitable for real-time applications.

Contribution

It proposes a non-autoregressive model integrating a modified FastSpeech 2 backbone with depth-wise decoding, enabling ultra-low latency streaming speech synthesis with improved quality.

Findings

01

Achieves 10.6-fold faster inference than traditional pipelines.

02

Attains an average latency of 48.99 ms, below human perception threshold.

03

Demonstrates language-independent deployment on English and Malay datasets.

Abstract

Real-time speech synthesis requires balancing inference latency and acoustic fidelity for interactive applications. Conventional continuous text-to-speech pipelines require computationally intensive neural vocoders to reconstruct phase information, creating a significant streaming bottleneck. Furthermore, regression-based acoustic modeling frequently induces spectral over-smoothing artifacts. To address these limitations, this paper proposes a novel end-to-end non-autoregressive architecture optimized for ultra-low latency block-wise generation, directly modeling the highly compressed discrete latent space of the Mimi neural audio codec. Integrating a modified FastSpeech 2 backbone with a progressive depth-wise sequential decoding strategy, the architecture dynamically conditions 32 layers of residual vector quantization codes. This mechanism resolves phonetic alignment degradation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.