Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models
Mohamad Ballout, Serwan Jassim, Elia Bruni

TL;DR
This paper systematically evaluates multimodal large language models on intuitive physics tasks, revealing that current models struggle with reasoning due to vision-language misalignment rather than visual understanding alone.
Contribution
It introduces a probing methodology to analyze how well models preserve physics-related information and identifies vision-language misalignment as a key limitation.
Findings
Models fail to reliably distinguish plausible from implausible physics scenarios.
Vision encoders capture physical cues, but this information is not effectively used by language models.
Improving vision-language alignment is crucial for better intuitive physics reasoning.
Abstract
This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
