Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning

Mohammad Ali Etemadi Naeen; Hoda Mohammadzade; Saeed Bagheri Shouraki

arXiv:2510.22056·cs.CV·October 28, 2025

Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning

Mohammad Ali Etemadi Naeen, Hoda Mohammadzade, Saeed Bagheri Shouraki

PDF

TL;DR

This paper presents a human-centric deep learning framework for anomaly detection in surveillance videos, combining YOLO-World, tracking, background suppression, and spatio-temporal modeling to improve accuracy and robustness.

Contribution

It introduces a novel pipeline integrating open-vocabulary human detection, identity tracking, background suppression, and deep spatio-temporal learning for multi-class anomaly classification.

Findings

01

Achieved 92.41% mean accuracy on UCF-Crime subset.

02

Per-class F1-scores exceeded 0.85, demonstrating strong performance.

03

Foreground-focused preprocessing improves anomaly detection accuracy.

Abstract

Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.