Chain of Time: In-Context Physical Simulation with Image Generation Models

YingQiao Wang; Eric Bigelow; Boyi Li; Tomer Ullman

arXiv:2511.00110·cs.CV·November 4, 2025

Chain of Time: In-Context Physical Simulation with Image Generation Models

YingQiao Wang, Eric Bigelow, Boyi Li, Tomer Ullman

PDF

Open Access 3 Reviews

TL;DR

This paper introduces 'Chain of Time', a method that enhances physical reasoning in vision-language models by generating intermediate images during simulation, improving understanding of physical properties without additional training.

Contribution

It presents a novel, cognitively-inspired inference technique that improves physical simulation in image generation models and provides new insights into their reasoning capabilities.

Findings

01

Significantly improves physical simulation performance in image models

02

Reveals models' ability to simulate physical properties over time

03

Identifies limitations in models' inference of physical parameters

Abstract

We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

- Studies an unique topic - physics understanding in image generation models - Motivates the study from an interdisciplinary perspective

Weaknesses

- limited evaluation. image generation models span a variety of designs, and evaluating only one is insufficient. - given that gpt's image generation is a closed model with little public detail, it may be difficult to act on these findings to improve image generation models - limited context from related work - experimental results are unclear. for example, the paper mentions that figure 6 shows that IGM is able to simulate the projectile's motion because it is close to ground truth, but the pat

Reviewer 02Rating 2Confidence 4

Strengths

The paper is interesting in its approach and the general context of the problem is important. It's a well written paper and is easy to follow. I also enjoyed the clarity of the method description and the ample detail given. I thought using computer vision algorithms to extract the state for better analysis was a nice idea, but see below.

Weaknesses

I think the main issue of the paper is its scope and especially the experimental setup. I understand why using such simple physical systems was necessary if exact state estimates are needed, but this is a major hinderance for the paper. The experiments only cover a very simple set of physical systems under ideal observation conditions - I feel that to conclude anything about model's abilities to reason about physics, much more detailed and elaborate systems should be examined and analyzed. At t

Reviewer 03Rating 4Confidence 4

Strengths

- shows strong results on 2d motion and gravity scene - There is a partial success in more complex simulations, like fluids and a bouncing ball - The method works at inference time and works with existing models like GPT-4o

Weaknesses

- The mechanism is implicit (de-render, transition based on world transition matrix, rendering) and difficult to test - In 3D scenes, the early error seems to compound, making it difficult to simulate longer time-steps - Not much comparison with other existing methods (Video or World-Models) for generating physically plausible images generation - The generalization seems limited to very simple scenes and breaks when applied to more complex physics problems (fluid, bouncing)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Model Reduction and Neural Networks