Speak the Art: A Direct Speech to Image Generation Framework

Mariam Saeed; Manar Amr; Farida Adel; Nada Hassan; Nour Walid; Eman Mohamed; Mohamed Hussein; Marwan Torki

arXiv:2601.00827·eess.AS·January 13, 2026

Speak the Art: A Direct Speech to Image Generation Framework

Mariam Saeed, Manar Amr, Farida Adel, Nada Hassan, Nour Walid, Eman Mohamed, Mohamed Hussein, Marwan Torki

PDF

Open Access

TL;DR

This paper introduces Speak the Art, a speech-to-image generation framework that uses a speech encoder supervised by a large image-text model and a VQ-Diffusion network, achieving superior results and stability over prior methods.

Contribution

The paper proposes a novel speech-to-image framework combining supervised speech encoding and diffusion models, improving stability, diversity, and multilingual capabilities.

Findings

01

Outperforms state-of-the-art models significantly

02

Uses diffusion instead of GANs for stable training

03

Demonstrates multilingual speech-to-image generation

Abstract

Direct speech-to-image generation has recently shown promising results. However, compared to text-to-image generation, there is still a large gap to enclose. Current approaches use two stages to tackle this task: speech encoding network and image generative adversarial network (GAN). The speech encoding networks in these approaches produce embeddings that do not capture sufficient linguistic information to semantically represent the input speech. GANs suffer from issues such as non-convergence, mode collapse, and diminished gradient, which result in unstable model parameters, limited sample diversity, and ineffective generator learning, respectively. To address these weaknesses, we introduce a framework called Speak the Art (STA) which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings. To improve speech embeddings, the speech encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism