Vision Language Models Cannot Reason About Physical Transformation
Dezhi Luo, Yijiang Li, Maijunxian Wang, Tianwei Zhao, Bingyang Wang, Siheng Wang, Pinyuan Feng, Pooyan Rahmanzadehgervi, Ziqiao Ma, Hokin Deng

TL;DR
This paper evaluates whether current Vision Language Models truly understand physical transformations, finding they largely fail to maintain invariant representations of physical properties across dynamic scenes, despite some superficial improvements.
Contribution
The study introduces ConservationBench, a new benchmark for assessing physical invariance in VLMs, revealing systematic failures in their reasoning about physical transformations.
Findings
Models perform near chance on conservation tasks.
Textual priors favor invariance, but visual content reduces performance.
Standard techniques do not improve models' understanding of physical transformations.
Abstract
Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant…
Peer Reviews
Decision·Submitted to ICLR 2026
**Originality** I'm not aware of any previous work investigating conservation in vision language models. **Quality** The benchmark is well constructed. **Clarity** The paper follows a clear outline, although I feel the writing is a bit bloated and unaccessible at parts. **Significance** The paper adds to a growing body of work highlighting the shortcomings of vision language models with basic visual processing. I like the split of the benchmark in conserving vs. non-conserving stimuli. Also,
In general, I feel like this paper does not provide a strong enough novel contribution to recommend acceptance. There is at this point a large growing body of evidence that vision language models fail at very basic visual processing. While this paper adds some novel data to this pile of findings, I find that the aspect of perception that it investigates is just too narrow. Also, the authors do not offer concrete ideas on how these problems could be overcome.
- This paper focuses on conservation under transformation with a counterfactual non-conserving item. - This paper provides the results under different prompt styles, frame counts, and frame-selection methods (uniform / human / SEVILA-style)
- The current evaluation setup only provides models maximum 16 frames. It is questionable that is this enough even for human to understand the physical transformation happening in the video. Therefore, the claim like “VLMs cannot reason about physical transformation” are overstated if the inputs to the models does not contain enough information to solve the task. - The human baseline details are missing. How did you evaluate the human performance exactly? - The paper does not evaluate state-o
This paper is methodologically sound, and is an incremental contribution to a growing literature exploring the physical reasoning capabilities of VLMs. Particular strengths are: * The rigorous control conditions used throughout to test alternative explanations for VLM model performance. * The large number of open-source models used. * The use of a meaningful human baseline for comparison.
The paper has a number of weaknesses: 1. The hybrid evaluation is an interesting solution to the problem of evaluating complex outputs, but using LLM judges incurs significant overhead for the practitioner. Since the paper currently relies only on open-source models (as far as I can tell) and the benchmark uses multiple choice questions, the authors could simply use the log-probability of the choice label, conditional on the text-image input. This could be normalised across the possible outcomes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Language, Metaphor, and Cognition
