Sound2Sight: Generating Visual Dynamics from Sound and Context

Anoop Cherian; Moitreya Chatterjee; Narendra Ahuja

arXiv:2007.12130·cs.CV·July 24, 2020·1 cites

Sound2Sight: Generating Visual Dynamics from Sound and Context

Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja

PDF

Open Access

TL;DR

Sound2Sight introduces a deep variational framework that generates future video frames conditioned on audio and past frames, enabling diverse and coherent visual predictions in multimodal reasoning tasks.

Contribution

It presents a novel stochastic prior conditioned on audio-visual embeddings and a multimodal discriminator, improving video synthesis quality and diversity.

Findings

01

Outperforms prior methods in video quality and diversity

02

Effective in occlusion reasoning scenarios

03

Validated on multiple datasets including new ones

Abstract

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis