Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

Hugo Garrido-Lestache Belinchon; Helina Mulugeta; Adam Haile

arXiv:2404.17608·cs.SD·April 30, 2024

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

Hugo Garrido-Lestache Belinchon, Helina Mulugeta, Adam Haile

PDF

Open Access 1 Repo

TL;DR

This paper introduces a sequence-to-sequence model utilizing VQ-VAE to generate diverse and realistic audio from silent videos, addressing previous limitations in sound diversity and generalization.

Contribution

It presents a novel approach combining VQ-VAE with a custom audio decoder for improved audio synthesis from silent videos.

Findings

01

Enhanced sound diversity in generated audio

02

Better generalization across different video domains

03

Improved performance over prior CNN and WaveNet methods

Abstract

Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Adam-Haile/vita-research-group
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Speech and Audio Processing · Music and Audio Processing

MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet