High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young, Cho, Nam Soo Kim

TL;DR
This paper introduces a two-stage TTS system using semantic and acoustic tokens, employing a transducer and G-MLM to improve speech quality and speaker similarity, especially in zero-shot scenarios.
Contribution
It presents a novel two-stage TTS framework with discrete tokens, combining a transducer and G-MLM for enhanced high-fidelity speech synthesis.
Findings
Outperforms conventional models in zero-shot speech quality
Achieves higher speaker similarity
Demonstrates robustness in aligning text to speech
Abstract
We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
