Align, Adapt and Inject: Sound-guided Unified Image Generation

Yue Yang; Kaipeng Zhang; Yuying Ge; Wenqi Shao; Zeyue Xue; Yu Qiao,; Ping Luo

arXiv:2306.11504·cs.GR·June 21, 2023·2 cites

Align, Adapt and Inject: Sound-guided Unified Image Generation

Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao,, Ping Luo

PDF

Open Access

TL;DR

This paper introduces a unified framework that leverages sound to guide image generation, editing, and stylization by aligning audio with textual and visual representations, enhancing the capabilities of diffusion models.

Contribution

The proposed AAI framework effectively adapts sound into a token compatible with existing diffusion-based T2I models, enabling sound-guided image tasks with improved performance.

Findings

01

Outperforms state-of-the-art sound-guided image generation methods.

02

Achieves competitive results in audio-visual and audio-text retrieval tasks.

03

Enables flexible sound-guided image editing and stylization.

Abstract

Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Multimodal Machine Learning Applications

MethodsDiffusion · Adapter · ALIGN