From Priors to Perception: Grounding Video-LLMs in Physical Reality
Zicheng Zhao, Chaofan Gan, Shijie Li, Weiyao Lin

TL;DR
This paper identifies the limitations of Video-LLMs in physical reasoning due to semantic prior dominance and introduces a new dataset and reasoning method to improve their understanding of physical laws.
Contribution
It proposes the Unified Attribution Theory, the Programmatic Adversarial Curriculum (PACC), and the Visual-Anchored Reasoning Chain (VARC) to enhance physical reasoning in Video-LLMs without architectural changes.
Findings
Standard fine-tuning with PACC improves physical reasoning in SOTA models.
The approach effectively decouples visual artifacts from logical errors.
Models show significant gains in understanding physical laws after intervention.
Abstract
While Video Large Language Models (Video-LLMs) excel in general understanding, they exhibit systematic deficits in fine-grained physical reasoning. Existing interventions not only suffer from limited generalization but fundamentally conflate generative artifacts with genuine physical fallacies. Furthermore, we find that models fail systematically not only in anti-physics anomalies but also in counter-intuitive scenarios where visual facts contradict statistical expectations. Accordingly, we propose the Unified Attribution Theory: this dual failure stems not from perception deficiency, but from Semantic Prior Dominance -- the reasoning mechanism is deeply hijacked by internal narrative scripts. To address this, we construct the Programmatic Adversarial Curriculum (PACC), the first high-fidelity adversarial video dataset synthesized based on physical laws, thoroughly decoupling visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
