Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning
Mohammad Ali Etemadi Naeen, Hoda Mohammadzade, Saeed Bagheri Shouraki

TL;DR
This paper presents a human-centric deep learning framework for anomaly detection in surveillance videos, combining YOLO-World, tracking, background suppression, and spatio-temporal modeling to improve accuracy and robustness.
Contribution
It introduces a novel pipeline integrating open-vocabulary human detection, identity tracking, background suppression, and deep spatio-temporal learning for multi-class anomaly classification.
Findings
Achieved 92.41% mean accuracy on UCF-Crime subset.
Per-class F1-scores exceeded 0.85, demonstrating strong performance.
Foreground-focused preprocessing improves anomaly detection accuracy.
Abstract
Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
