S2IGAN: Speech-to-Image Generation via Adversarial Learning

Xinsheng Wang; Tingting Qiao; Jihua Zhu; Alan Hanjalic; Odette; Scharenborg

arXiv:2005.06968·cs.LG·September 16, 2020·1 cites

S2IGAN: Speech-to-Image Generation via Adversarial Learning

Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette, Scharenborg

PDF

Open Access 2 Repos

TL;DR

This paper introduces S2IGAN, a novel framework that converts speech descriptions directly into photo-realistic images without text, enabling applications for unwritten languages and advancing speech-to-image generation technology.

Contribution

The paper proposes S2IGAN, a new speech-to-image generation framework that uses speech embeddings and a relation-supervised generative model, without relying on text data.

Findings

01

Effective synthesis of high-quality images from speech signals.

02

Semantic consistency between speech descriptions and generated images.

03

Strong baseline performance on benchmark datasets.

Abstract

An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on two public benchmark datasets CUB and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music and Audio Processing