I Hear Your True Colors: Image Guided Audio Generation
Roy Sheffer, Yossi Adi

TL;DR
Im2Wav is a novel image-guided audio generation system that leverages hierarchical Transformer models and CLIP embeddings to produce semantically relevant high-fidelity sounds from images, outperforming baselines.
Contribution
The paper introduces Im2Wav, combining hierarchical audio modeling with CLIP-based visual conditioning and a new benchmark dataset for image-to-audio tasks.
Findings
Outperforms baseline models in fidelity and relevance
Uses CLIP embeddings for effective visual conditioning
Provides an ablation study and a new evaluation dataset
Abstract
We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we upsample the audio tokens using an additional language model to generate a high-fidelity audio sample. We use the rich semantics of a pre-trained CLIP (Contrastive Language-Image Pre-training) embedding as a visual representation to condition the language model. In addition, to steer the generation process towards the conditioning image, we apply the classifier-free guidance method. Results suggest that Im2Wav significantly outperforms the evaluated baselines in both fidelity and relevance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis
