S2IGAN: Speech-to-Image Generation via Adversarial Learning
Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette, Scharenborg

TL;DR
This paper introduces S2IGAN, a novel framework that converts speech descriptions directly into photo-realistic images without text, enabling applications for unwritten languages and advancing speech-to-image generation technology.
Contribution
The paper proposes S2IGAN, a new speech-to-image generation framework that uses speech embeddings and a relation-supervised generative model, without relying on text data.
Findings
Effective synthesis of high-quality images from speech signals.
Semantic consistency between speech descriptions and generated images.
Strong baseline performance on benchmark datasets.
Abstract
An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on two public benchmark datasets CUB and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music and Audio Processing
