Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed

TL;DR
This paper presents a pure transformer model for autoregressive video prediction of physical simulations, extending prediction horizons and enabling interpretability without complex training or latent features.
Contribution
It introduces a simple, parameter-efficient transformer approach for pixel-space video prediction that outperforms latent-space methods in physical simulation tasks.
Findings
Extends prediction horizon by up to 50% compared to latent-space models
Maintains comparable video quality metrics
Enables interpretability and generalization to out-of-distribution parameters
Abstract
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
