Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Jingyuan Xing; Zhipeng Li; Jialong Mai; Xiaofen Xing; Xiangmin Xu

arXiv:2508.04141·eess.AS·August 29, 2025

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Jingyuan Xing, Zhipeng Li, Jialong Mai, Xiaofen Xing, Xiangmin Xu

PDF

TL;DR

Parallel GPT introduces a novel TTS framework that effectively combines autoregressive and non-autoregressive modules to better capture acoustic and semantic correlations, significantly enhancing zero-shot speech synthesis quality and efficiency.

Contribution

The paper proposes a new parallel TTS architecture that harmonizes independence and interdependence of acoustic and semantic features using AR and NAR modules, with a Parallel Tokenizer and Coupled NAR model.

Findings

01

Outperforms existing zero-shot TTS models in quality and efficiency

02

Demonstrates effectiveness on English and Chinese datasets

03

Provides high-quality speech synthesis with parallel structure

Abstract

Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.