E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li,, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing, Liu, Sheng Zhao, Naoyuki Kanda

TL;DR
E2 TTS is a simple, fully non-autoregressive zero-shot TTS system that achieves human-level naturalness and state-of-the-art speaker similarity without complex components.
Contribution
It introduces a straightforward flow-matching-based mel spectrogram generator for zero-shot TTS, eliminating the need for additional models or complex alignment techniques.
Findings
Achieves state-of-the-art zero-shot TTS performance.
Comparable or surpasses Voicebox and NaturalSpeech 3.
Flexible input representation and inference variants.
Abstract
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Industrial Vision Systems and Defect Detection · Ultrasonics and Acoustic Wave Propagation
