Show and Speak: Directly Synthesize Spoken Description of Images
Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette, Scharenborg

TL;DR
This paper introduces the Show and Speak (SAS) model that directly synthesizes spoken image descriptions from images without using text or phonemes, demonstrating feasibility through experiments on Flickr8k.
Contribution
The SAS model is the first to directly generate speech descriptions of images from visual input without intermediate text or phoneme representations.
Findings
Successfully synthesizes natural spoken descriptions for images
Demonstrates feasibility of bypassing text and phonemes in speech synthesis
Achieves promising results on Flickr8k benchmark
Abstract
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsDilated Causal Convolution · Mixture of Logistic Distributions · WaveNet
