Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
Hari Lee

TL;DR
This paper proposes a novel language-based framework for weakly supervised video anomaly detection that uses textual representations and knowledge reasoning to improve interpretability and reliability in surveillance applications.
Contribution
It introduces a new text-based approach that transforms videos into language, structures semantic knowledge, and generates explanations, enhancing interpretability over traditional visual feature methods.
Findings
Achieves competitive anomaly detection performance on UCF-Crime and XD-Violence datasets.
Provides interpretable, knowledge-grounded explanations for detected anomalies.
Demonstrates the effectiveness of textual reasoning in real-world surveillance scenarios.
Abstract
We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Video Analysis and Summarization · Human Pose and Action Recognition
