Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Yao Gu; Xiaohao Xu; and Yingna Wu

arXiv:2603.15237·cs.CV·March 17, 2026

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Yao Gu, Xiaohao Xu, and Yingna Wu

PDF

Open Access

TL;DR

This paper presents a physics-informed instruction tuning framework for vision-language models, significantly improving their ability to detect physics-grounded anomalies in videos by incorporating dynamic constraints and causal reasoning.

Contribution

The authors introduce a novel physics-informed instruction tuning method that encodes physical priors into structured prompts, enhancing anomaly detection and causal explanation capabilities of vision-language models.

Findings

01

Achieves 96.7% AUROC on Phys-AD benchmark, outperforming previous SOTA (66.9%)

02

Enables robust causal reasoning and explanations of dynamic anomalies

03

Demonstrates the effectiveness of structured physics priors in vision-language models

Abstract

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)