SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

Huimeng Wang; Hui Lu; Jiajun Deng; Haoning Xu; Youjun Chen; Xueyuan Chen; Zhaoqing Li; Shuhai Peng; Shiyin Kang; Xunying Liu

arXiv:2605.16964·eess.AS·May 19, 2026

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

Huimeng Wang, Hui Lu, Jiajun Deng, Haoning Xu, Youjun Chen, Xueyuan Chen, Zhaoqing Li, Shuhai Peng, Shiyin Kang, Xunying Liu

PDF

TL;DR

SemaVoice is a novel semantic-aware autoregressive TTS framework that improves zero-shot speech synthesis quality by aligning speech representations with semantic and structural cues.

Contribution

It introduces a Speech Foundation Model guided alignment mechanism to enhance continuous speech representations for high-fidelity zero-shot TTS.

Findings

01

Achieves 1.71% WER on Seed-TTS benchmark.

02

Outperforms existing open-source systems in objective and subjective evaluations.

03

Shows significant improvements with SFM guided alignment across different granularities.

Abstract

Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This mismatch causes TTS models to focus excessively on low-level acoustic textures at the expense of high-level semantic coherence, further exacerbating error accumulation in autoregressive generation. To address this challenge, we propose SemaVoice, a semantic-aware continuous autoregressive framework for high-fidelity zero-shot TTS. SemaVoice introduces a Speech Foundation Model (SFM) guided alignment mechanism that refines continuous speech representations to better capture both local semantic consistency and global structural relationships. These representations condition a patch-wise diffusion head…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.