Images that Sound: Composing Images and Sounds on a Single Canvas

Ziyang Chen; Daniel Geng; Andrew Owens

arXiv:2405.12221·cs.CV·February 6, 2025

Images that Sound: Composing Images and Sounds on a Single Canvas

Ziyang Chen, Daniel Geng, Andrew Owens

PDF

Open Access

TL;DR

This paper introduces a zero-shot method to generate spectrograms that visually resemble natural images and sound like natural audio by leveraging shared latent space diffusion models, enabling cross-modal synthesis.

Contribution

It presents a novel zero-shot approach using pre-trained diffusion models to synthesize spectrograms that simultaneously match visual and auditory prompts.

Findings

01

Generated spectrograms align with audio prompts

02

Spectrograms visually resemble target images

03

Method works without additional training or fine-tuning

Abstract

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage

MethodsALIGN · Diffusion