Sound2Sight: Generating Visual Dynamics from Sound and Context
Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja

TL;DR
Sound2Sight introduces a deep variational framework that generates future video frames conditioned on audio and past frames, enabling diverse and coherent visual predictions in multimodal reasoning tasks.
Contribution
It presents a novel stochastic prior conditioned on audio-visual embeddings and a multimodal discriminator, improving video synthesis quality and diversity.
Findings
Outperforms prior methods in video quality and diversity
Effective in occlusion reasoning scenarios
Validated on multiple datasets including new ones
Abstract
Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
