Can Image-To-Video Models Simulate Pedestrian Dynamics?
Aaron Appelle, Jerome P. Lynch

TL;DR
This paper explores whether advanced image-to-video diffusion transformer models can accurately simulate pedestrian movements in crowded scenes by conditioning on keyframes and evaluating their trajectory predictions.
Contribution
It introduces a framework for assessing I2V models' ability to generate realistic pedestrian dynamics conditioned on keyframes from benchmark datasets.
Findings
I2V models can produce plausible pedestrian trajectories.
Quantitative evaluation shows competitive prediction accuracy.
Models demonstrate potential for crowd simulation applications.
Abstract
Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvacuation and Crowd Dynamics · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
