Seeing What You Say: Expressive Image Generation from Speech

Jiyoung Lee; Song Park; Sanghyuk Chun; Soo-Whan Chung

arXiv:2511.03423·eess.AS·November 6, 2025

Seeing What You Say: Expressive Image Generation from Speech

Jiyoung Lee, Song Park, Sanghyuk Chun, Soo-Whan Chung

PDF

Open Access

TL;DR

VoxStudio is an innovative end-to-end speech-to-image model that captures expressive and emotional nuances directly from speech, bypassing text transcription, and is supported by a new large-scale emotional speech-image dataset.

Contribution

It introduces VoxStudio, the first unified model that directly generates expressive images from speech using a novel speech information bottleneck module.

Findings

01

Effective generation of expressive images from speech demonstrated on multiple benchmarks.

02

Preserves prosody and emotional cues better than text-based methods.

03

Highlights challenges like emotional consistency and linguistic ambiguity.

Abstract

This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Emotion and Mood Recognition · Generative Adversarial Networks and Image Synthesis