TL;DR
This paper introduces vTTS, a novel speech synthesis method that generates natural and expressive speech directly from visual text images, capturing emphasis and emotion without extra labels.
Contribution
vTTS is the first approach to synthesize speech from visual text images, preserving visual features and enabling emotion transfer without additional annotations.
Findings
vTTS produces speech with naturalness comparable or superior to traditional TTS.
It effectively transfers emphasis and emotion from visual text.
It synthesizes more natural speech from unseen and rare characters.
Abstract
This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text. The proposed vTTS extracts visual features with a convolutional neural network and then generates acoustic features with a non-autoregressive model inspired by FastSpeech2. Experimental results show that 1) vTTS is capable of generating speech with naturalness comparable to or better than a conventional TTS, 2) it can transfer emphasis and emotion attributes in visual text to speech without additional labels and architectures, and 3) it can synthesize more natural and intelligible speech from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
