Video Generation with Learned Action Prior
Meenakshi Sarkar, Devansh Bhardwaj, Debasish Ghose

TL;DR
This paper introduces models that explicitly incorporate camera motion as part of the observed state to improve stochastic video generation, especially in moving camera scenarios, using variational inference and diffusion processes.
Contribution
It proposes three novel models that integrate action prior learning into video generation, addressing partial observability and complex dynamics in moving camera videos.
Findings
Models outperform existing methods on the RoAM dataset.
Explicit action modeling improves video realism and diversity.
Multi-modal training enhances generation quality in partial observability scenarios.
Abstract
Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the…
Peer Reviews
Decision·Submitted to ICLR 2025
The problem of partially observable video prediction is quite interesting and has applications in autonomous vehicles that have an onboard camera (such as autonomous cars/taxis), drones (with an onboard camera), and robot manipulators (that have wrist-mounted cameras). Tackling video prediction under partially observable settings (where the acting agent is not visible on the camera) can be benefit robot applications (for instance pedestrian intent detection and prediction could influence auton
Although sound, I find the contributions (incorporating actions) to be minimal additions to the existing frameworks. For instance, in VG-leap, the extended image-action state pair is used to condition the SVG-lp model instead of just the images, and the latent posterior approximated with recurrent modules. Similarly, in Causal-leap, two stochastic posteriors are learned -- one each for image and action, learnt using recurrent modules. For RAFI, the image latent is concatenated with the action
* The paper adapts action-free VAE and flow-based formulation from video prediction literature to action-conditioned learning. * Empirical results show benefit of incorporating action information in training in terms of video prediction accuracy.
* Missing related works. The paper revisits the literature on VAE-based and flow-matching-based video prediction models, without discussing the latter in the section of prior works. * The method restricts cameras to be static (line 148). This assumption does not hold in general for casual videos outside of the training data being used and limits the applicability of the method. * The assumption of causality between actions $a_t$ and observed framer $x_t$ is again specific to robot manipulation
- It shows that combining the camera motion dynamics with visual dynamics will help the video prediction as well as the action prediction. This idea is straightforward but is useful in many cases, such as the design of the world models in embodied agents. - The paper is easy to follow, and the supplementary materials seem comprehensive.
Novelty - The main weakness of this paper is lack of novelty. Indeed, applying camera motion dynamics to condition video generation is not a new story. Recent studies have even tried to customize video generation with user-directed camera movement (Direct-a-video, siggraph'24), or abstract textual motion descriptions (LEGO, eccv'24). In this case, multimodal training of actions and images to the basemodel such as SVG-LP may not meet the expectations of the audience in recent research communities
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsDiffusion · Variational Inference
