Video Occupancy Models
Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John, Langford, Matthew E. Taylor, Sergey Levine

TL;DR
Video Occupancy models (VOCs) are a new class of video prediction models that operate in a compact latent space and directly predict future state distributions, improving efficiency for control tasks.
Contribution
VOCs introduce a novel approach by predicting future states in a single step within a latent space, avoiding multistep roll-outs and enhancing predictive accuracy for control applications.
Findings
VOCs operate efficiently in a compact latent space.
VOCs outperform prior models in predictive accuracy.
Code is publicly available for reproducibility.
Abstract
We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment
