Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal   Latent Alignment

Kim Sung-Bin; Arda Senocak; Hyunwoo Ha; Tae-Hyun Oh

arXiv:2412.06209·cs.CV·December 10, 2024

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal approach to generate diverse visual scenes from in-the-wild sounds by aligning audio and visual features in a shared latent space, improving image quality and control.

Contribution

It presents a new model that aligns audio-visual modalities through enriched features and demonstrates superior results and generalizability across datasets and architectures.

Findings

01

Outperforms previous methods on VEGAS and VGGSound datasets

02

Enables control over generated visuals via input manipulations

03

Shows effective alignment of audio-visual signals in the embedding space

Abstract

How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

postech-ami/Sound2Scene
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization