I Hear Your True Colors: Image Guided Audio Generation

Roy Sheffer; Yossi Adi

arXiv:2211.03089·cs.SD·February 28, 2023·5 cites

I Hear Your True Colors: Image Guided Audio Generation

Roy Sheffer, Yossi Adi

PDF

Open Access 1 Repo

TL;DR

Im2Wav is a novel image-guided audio generation system that leverages hierarchical Transformer models and CLIP embeddings to produce semantically relevant high-fidelity sounds from images, outperforming baselines.

Contribution

The paper introduces Im2Wav, combining hierarchical audio modeling with CLIP-based visual conditioning and a new benchmark dataset for image-to-audio tasks.

Findings

01

Outperforms baseline models in fidelity and relevance

02

Uses CLIP embeddings for effective visual conditioning

03

Provides an ablation study and a new evaluation dataset

Abstract

We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we upsample the audio tokens using an additional language model to generate a high-fidelity audio sample. We use the rich semantics of a pre-trained CLIP (Contrastive Language-Image Pre-training) embedding as a visual representation to condition the language model. In addition, to steer the generation process towards the conditioning image, we apply the classifier-free guidance method. Results suggest that Im2Wav significantly outperforms the evaluated baselines in both fidelity and relevance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RoySheffer/im2wav
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis