HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs
Zhaolin Cai, Fan Li, Ziwei Zheng, Yanjun Qin

TL;DR
HiProbe-VAD introduces a tuning-free framework that leverages hidden states of pre-trained multimodal large language models for effective video anomaly detection, outperforming traditional methods without requiring fine-tuning.
Contribution
The paper proposes a novel dynamic layer saliency probing mechanism to extract informative hidden states from MLLMs for anomaly detection, enabling a practical, scalable, and tuning-free approach.
Findings
Outperforms existing training-free methods on UCF-Crime and XD-Violence datasets.
Demonstrates strong cross-model generalization without fine-tuning.
Utilizes intermediate hidden states for higher sensitivity and linear separability of anomalies.
Abstract
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
