Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
Hugo Garrido-Lestache Belinchon, Helina Mulugeta, Adam Haile

TL;DR
This paper introduces a sequence-to-sequence model utilizing VQ-VAE to generate diverse and realistic audio from silent videos, addressing previous limitations in sound diversity and generalization.
Contribution
It presents a novel approach combining VQ-VAE with a custom audio decoder for improved audio synthesis from silent videos.
Findings
Enhanced sound diversity in generated audio
Better generalization across different video domains
Improved performance over prior CNN and WaveNet methods
Abstract
Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Speech and Audio Processing · Music and Audio Processing
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet
