Sound-Guided Semantic Image Manipulation
Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young, Kim, Jinkyu Kim, Sangpil Kim

TL;DR
This paper introduces a novel framework that encodes sound into a multi-modal embedding space to enable dynamic, emotion-rich image manipulation guided by audio, expanding beyond traditional text-based methods.
Contribution
It proposes a new sound encoding method aligned with image-text embeddings, allowing sound-guided image manipulation and multimodal mixing, outperforming existing methods.
Findings
Effective sound-guided image manipulation demonstrated
Ability to mix text and audio modalities for richer modifications
Outperforms state-of-the-art in zero-shot audio and image classification
Abstract
The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent optimization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix text and audio modalities, which enrich…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
