Generating Realistic Images from In-the-wild Sounds

Taegyeong Lee; Jeonghun Kang; Hyeonyu Kim; Taehwan Kim

arXiv:2309.02405·cs.CV·September 6, 2023

Generating Realistic Images from In-the-wild Sounds

Taegyeong Lee, Jeonghun Kang, Hyeonyu Kim, Taehwan Kim

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel method to generate realistic images from in-the-wild sounds by converting sound to text, using attention mechanisms, and optimizing with CLIPscore, achieving superior results over baselines.

Contribution

The study presents a new pipeline combining audio captioning, attention mechanisms, and diffusion models for sound-to-image generation in wild environments.

Findings

01

High-quality image generation from wild sounds

02

Outperforms baseline methods in evaluations

03

Effective use of audio captioning and attention mechanisms

Abstract

Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and images and the significant differences in the characteristics of these two modalities. Previous studies have focused on generating images from sound in limited categories or music. In this paper, we propose a novel approach to generate images from in-the-wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound. Lastly, we propose a direct sound optimization with CLIPscore and AudioCLIP and generate images with a diffusion-based model. In experiments, it shows that our model is able to generate high quality images from wild sounds and outperforms baselines in both quantitative and qualitative evaluations on wild audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Generating Realistic Images from In-the-wild Sounds· youtube

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization