Can vision language models learn intuitive physics from interaction?
Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

TL;DR
This paper investigates whether vision-language models can develop intuitive physics through interaction, finding that current training methods, including reinforcement learning, do not produce models with generalizable physical understanding.
Contribution
It demonstrates that interaction-based training alone is insufficient for models to acquire robust, generalizable physical intuitions, highlighting a gap in current learning approaches.
Findings
Models improve within-task performance with interaction
Models fail to generalize physical knowledge to new tasks
Interaction training does not produce robust physical intuitions
Abstract
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are…
Peer Reviews
Decision·Submitted to ICLR 2026
The research question is well-motivated - comparing RL versus passive supervised learning for physical reasoning is an important question for embodied VLAs and robotics. The method section is clearly written and self-contained, with sufficient implementation details. The experimental design is clean and controlled, the writing of the experimental section is ordered nicely, as each experiment motivates the following ones.
While this serves as a useful motivating example, the scope of these experiments (single model, relatively narrow custom task) makes it difficult to draw significant conclusions, and it's unclear whether these findings can be generalized to larger scale training with multiple tasks. I like the direction of this work but I believe the findings and experiments are too limited for a full publication at ICLR. The results are hard to parse - a simple table summarizing performance and rewards would c
1. Training models to understand intuitive physics by interacting with the environment is a well motivated hypothesis, as it is similar to how babies learn. 2. The proposed metrics to test models seems reasonable, and the authors conduct rigorous evaluations.
1. The dataset seems a bit simple and contrived. The fact that supervised learning performs as good as reinforcement learning might be because it’s a really easy task, and not because both methods fundamentally work equally well. 2. I strongly disagree with the statement made in the conclusion that “these results cast doubt on whether posttraining methods are sufficient for developing models that reason about the world in a human-like manner”. The models not generalizing to new tasks, might jus
- Clear question and hypotheses grounded in cognitive science - Controlled comparison of SFT and RL with matched PEFT settings and budgets - Simple tasks with explicit rewards and prompt templates, plus training logs - Generalization matrix across all train and test task pairs - Decodability analysis that separates representation competence from output performance - Useful visualization of reward landscapes and attention maps - Negative results are reported transparently
- Very narrow scope. One model family at one size and one environment - Interaction is minimal. One step RL with short textual actions, not true multi-step closed loop control - Fixed camera and block sizes make pixel shortcuts likely, which undermines conclusions about physics learning - No baselines for multitask SFT, joint training across tasks, or auxiliary representation losses - Linear probe dataset is small and lacks controls such as image only probes or interventions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Explainable Artificial Intelligence (XAI)
