Can vision language models learn intuitive physics from interaction?

Luca M. Schulze Buschoff; Konstantinos Voudouris; Can Demircan; Eric Schulz

arXiv:2602.06033·cs.LG·February 6, 2026

Can vision language models learn intuitive physics from interaction?

Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether vision-language models can develop intuitive physics through interaction, finding that current training methods, including reinforcement learning, do not produce models with generalizable physical understanding.

Contribution

It demonstrates that interaction-based training alone is insufficient for models to acquire robust, generalizable physical intuitions, highlighting a gap in current learning approaches.

Findings

01

Models improve within-task performance with interaction

02

Models fail to generalize physical knowledge to new tasks

03

Interaction training does not produce robust physical intuitions

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The research question is well-motivated - comparing RL versus passive supervised learning for physical reasoning is an important question for embodied VLAs and robotics. The method section is clearly written and self-contained, with sufficient implementation details. The experimental design is clean and controlled, the writing of the experimental section is ordered nicely, as each experiment motivates the following ones.

Weaknesses

While this serves as a useful motivating example, the scope of these experiments (single model, relatively narrow custom task) makes it difficult to draw significant conclusions, and it's unclear whether these findings can be generalized to larger scale training with multiple tasks. I like the direction of this work but I believe the findings and experiments are too limited for a full publication at ICLR. The results are hard to parse - a simple table summarizing performance and rewards would c

Reviewer 02Rating 2Confidence 4

Strengths

1. Training models to understand intuitive physics by interacting with the environment is a well motivated hypothesis, as it is similar to how babies learn. 2. The proposed metrics to test models seems reasonable, and the authors conduct rigorous evaluations.

Weaknesses

1. The dataset seems a bit simple and contrived. The fact that supervised learning performs as good as reinforcement learning might be because it’s a really easy task, and not because both methods fundamentally work equally well. 2. I strongly disagree with the statement made in the conclusion that “these results cast doubt on whether posttraining methods are sufficient for developing models that reason about the world in a human-like manner”. The models not generalizing to new tasks, might jus

Reviewer 03Rating 6Confidence 4

Strengths

- Clear question and hypotheses grounded in cognitive science - Controlled comparison of SFT and RL with matched PEFT settings and budgets - Simple tasks with explicit rewards and prompt templates, plus training logs - Generalization matrix across all train and test task pairs - Decodability analysis that separates representation competence from output performance - Useful visualization of reward landscapes and attention maps - Negative results are reported transparently

Weaknesses

- Very narrow scope. One model family at one size and one environment - Interaction is minimal. One step RL with short textual actions, not true multi-step closed loop control - Fixed camera and block sizes make pixel shortcuts likely, which undermines conclusions about physics learning - No baselines for multitask SFT, joint training across tasks, or auxiliary representation losses - Linear probe dataset is small and lacks controls such as image only probes or interventions

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Explainable Artificial Intelligence (XAI)