Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Tianyun Liu

TL;DR
Clip-TTS introduces a novel TTS approach leveraging CLIP architecture to enhance semantic understanding during encoding, resulting in high-quality, fast-synthesized speech with state-of-the-art MOS scores on multiple datasets.
Contribution
The paper presents a new TTS method that integrates CLIP to connect text semantics with mel-spectrograms, improving speech quality without sacrificing inference speed.
Findings
Achieves state-of-the-art MOS scores on LJSpeech and Baker datasets.
Demonstrates effective multi-emotion speech synthesis.
Maintains fast inference speeds with Transformer architecture.
Abstract
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAbsolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer
