Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Dean L Slack; G Thomas Hudson; Thomas Winterbottom; Noura Al Moubayed

arXiv:2510.20807·cs.CV·October 24, 2025

Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed

PDF

TL;DR

This paper presents a pure transformer model for autoregressive video prediction of physical simulations, extending prediction horizons and enabling interpretability without complex training or latent features.

Contribution

It introduces a simple, parameter-efficient transformer approach for pixel-space video prediction that outperforms latent-space methods in physical simulation tasks.

Findings

01

Extends prediction horizon by up to 50% compared to latent-space models

02

Maintains comparable video quality metrics

03

Enables interpretability and generalization to out-of-distribution parameters

Abstract

Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.