Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao,, Ping Luo

TL;DR
This paper introduces a unified framework that leverages sound to guide image generation, editing, and stylization by aligning audio with textual and visual representations, enhancing the capabilities of diffusion models.
Contribution
The proposed AAI framework effectively adapts sound into a token compatible with existing diffusion-based T2I models, enabling sound-guided image tasks with improved performance.
Findings
Outperforms state-of-the-art sound-guided image generation methods.
Achieves competitive results in audio-visual and audio-text retrieval tasks.
Enables flexible sound-guided image editing and stylization.
Abstract
Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Multimodal Machine Learning Applications
MethodsDiffusion · Adapter · ALIGN
