Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Jingyuan Xing, Zhipeng Li, Jialong Mai, Xiaofen Xing, Xiangmin Xu

TL;DR
Parallel GPT introduces a novel TTS framework that effectively combines autoregressive and non-autoregressive modules to better capture acoustic and semantic correlations, significantly enhancing zero-shot speech synthesis quality and efficiency.
Contribution
The paper proposes a new parallel TTS architecture that harmonizes independence and interdependence of acoustic and semantic features using AR and NAR modules, with a Parallel Tokenizer and Coupled NAR model.
Findings
Outperforms existing zero-shot TTS models in quality and efficiency
Demonstrates effectiveness on English and Chinese datasets
Provides high-quality speech synthesis with parallel structure
Abstract
Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
