Programmatic Video Prediction Using Large Language Models
Hao Tang, Kevin Ellis, Suhas Lohit, Michael J. Jones, Moitreya Chatterjee

TL;DR
ProgGen introduces a neuro-symbolic, large language model-based approach for video prediction that models dynamics with interpretable states, outperforming existing methods and enabling counter-factual reasoning.
Contribution
This work presents ProgGen, a novel method leveraging LLMs/VLMs to synthesize programs for estimating and predicting video states, enhancing interpretability and performance in video prediction tasks.
Findings
Outperforms competing techniques in PhyWorld and Cart Pole environments
Enables counter-factual reasoning and interpretable video generation
Demonstrates effectiveness and generalizability for video prediction
Abstract
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Machine Learning in Healthcare · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
