Show and Speak: Directly Synthesize Spoken Description of Images

Xinsheng Wang; Siyuan Feng; Jihua Zhu; Mark Hasegawa-Johnson; Odette; Scharenborg

arXiv:2010.12267·cs.CV·November 18, 2020

Show and Speak: Directly Synthesize Spoken Description of Images

Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette, Scharenborg

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Show and Speak (SAS) model that directly synthesizes spoken image descriptions from images without using text or phonemes, demonstrating feasibility through experiments on Flickr8k.

Contribution

The SAS model is the first to directly generate speech descriptions of images from visual input without intermediate text or phoneme representations.

Findings

01

Successfully synthesizes natural spoken descriptions for images

02

Demonstrates feasibility of bypassing text and phonemes in speech synthesis

03

Achieves promising results on Flickr8k benchmark

Abstract

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinshengwang/Show-and-Speak
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsDilated Causal Convolution · Mixture of Logistic Distributions · WaveNet