High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer   and Group Masked Language Model

Joun Yeop Lee; Myeonghun Jeong; Minchan Kim; Ji-Hyun Lee; Hoon-Young; Cho; Nam Soo Kim

arXiv:2406.17310·eess.AS·June 26, 2024·Interspeech·1 cites

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young, Cho, Nam Soo Kim

PDF

Open Access

TL;DR

This paper introduces a two-stage TTS system using semantic and acoustic tokens, employing a transducer and G-MLM to improve speech quality and speaker similarity, especially in zero-shot scenarios.

Contribution

It presents a novel two-stage TTS framework with discrete tokens, combining a transducer and G-MLM for enhanced high-fidelity speech synthesis.

Findings

01

Outperforms conventional models in zero-shot speech quality

02

Achieves higher speaker similarity

03

Demonstrates robustness in aligning text to speech

Abstract

We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques