High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks
Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V., Le, Honglak Lee

TL;DR
This paper explores whether minimal inductive biases combined with large neural networks can effectively predict future video frames, achieving state-of-the-art results across diverse datasets without complex architectural assumptions.
Contribution
It presents the first large-scale empirical study on video prediction with minimal biases and demonstrates that large stochastic recurrent neural networks can outperform specialized models.
Findings
Achieved state-of-the-art performance on three diverse datasets.
Large models with minimal biases can effectively predict complex video dynamics.
Questioned the necessity of handcrafted inductive biases in video prediction.
Abstract
Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
