Motion and Context-Aware Audio-Visual Conditioned Video Prediction
Yating Xu, Conghui Hu, Gim Hee Lee

TL;DR
This paper introduces a novel approach for audio-visual conditioned video prediction that decouples motion and appearance modeling, utilizing motion estimation and context-aware refinement to improve long-term prediction accuracy.
Contribution
The method separates motion and appearance modeling, incorporating motion-conditioned affine transformations and context-aware refinement for enhanced long-term video prediction.
Findings
Achieves competitive results on benchmark datasets.
Effectively models long-term video sequences.
Improves prediction quality by decoupling motion and appearance.
Abstract
The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Image Enhancement Techniques · Image and Signal Denoising Methods
MethodsBalanced Selection
