Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video   Generation

Rafael Redondo

arXiv:2406.16155·cs.SD·June 25, 2024

Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation

Rafael Redondo

PDF

Open Access

TL;DR

This paper introduces novel techniques to improve the quality and temporal consistency of sound-to-video generative adversarial networks, addressing the challenge of smooth video dynamics in audiovisual synthesis.

Contribution

It proposes three innovative features—sound routing, multi-scale recurrent sound analysis, and a new convolutional layer—that enhance image quality and temporal coherence in sound-to-video GANs.

Findings

01

Enhanced video quality and coherency demonstrated

02

Improved temporal dynamics in generated videos

03

Baseline architecture performance significantly increased

Abstract

Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing