Sound-Guided Semantic Image Manipulation

Seung Hyun Lee; Wonseok Roh; Wonmin Byeon; Sang Ho Yoon; Chan Young; Kim; Jinkyu Kim; Sangpil Kim

arXiv:2112.00007·cs.GR·December 2, 2021

Sound-Guided Semantic Image Manipulation

Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young, Kim, Jinkyu Kim, Sangpil Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel framework that encodes sound into a multi-modal embedding space to enable dynamic, emotion-rich image manipulation guided by audio, expanding beyond traditional text-based methods.

Contribution

It proposes a new sound encoding method aligned with image-text embeddings, allowing sound-guided image manipulation and multimodal mixing, outperforming existing methods.

Findings

01

Effective sound-guided image manipulation demonstrated

02

Ability to mix text and audio modalities for richer modifications

03

Outperforms state-of-the-art in zero-shot audio and image classification

Abstract

The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent optimization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix text and audio modalities, which enrich…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kuai-lab/sound-guided-semantic-image-manipulation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis