Sound-Guided Semantic Video Generation

Seung Hyun Lee; Gyeongrok Oh; Wonmin Byeon; Chanyoung Kim; Won Jeong; Ryoo; Sang Ho Yoon; Hyunjun Cho; Jihyun Bae; Jinkyu Kim; Sangpil Kim

arXiv:2204.09273·cs.CV·October 24, 2022·1 cites

Sound-Guided Semantic Video Generation

Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Chanyoung Kim, Won Jeong, Ryoo, Sang Ho Yoon, Hyunjun Cho, Jihyun Bae, Jinkyu Kim, Sangpil Kim

PDF

Open Access

TL;DR

This paper introduces a novel sound-guided framework for semantic video generation that leverages multimodal embedding spaces and a sound inversion module to produce high-quality, semantically consistent videos.

Contribution

It proposes a new method combining sound inversion and CLIP-based embeddings to generate videos aligned with audio, advancing semantic video synthesis.

Findings

01

Outperforms state-of-the-art in video quality

02

Provides a new high-resolution landscape video dataset

03

Demonstrates effective applications in editing

Abstract

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Advanced Vision and Imaging

MethodsStyleGAN · Adaptive Instance Normalization · Dense Connections · Convolution · HuMan(Expedia)||How do I get a human at Expedia? · R1 Regularization · Feedforward Network