CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

Hanwen Liu; Saierdaer Yusuyin; Hao Huang; Zhijian Ou

arXiv:2602.19574·eess.AS·February 24, 2026

CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

Hanwen Liu, Saierdaer Yusuyin, Hao Huang, Zhijian Ou

PDF

Open Access

TL;DR

This paper introduces CTC-TTS, a novel dual-streaming text-to-speech system using CTC alignment and bi-word interleaving, achieving better quality and lower latency than previous methods.

Contribution

It replaces traditional MFA alignment with a CTC-based aligner and proposes two variants for different quality-latency trade-offs, advancing low-latency neural TTS.

Findings

01

Outperforms MFA-based and fixed-ratio methods in streaming synthesis

02

Achieves higher quality in zero-shot TTS tasks

03

Demonstrates effective low-latency synthesis with CTC-TTS-F

Abstract

Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing