# Attention-guided saptio-temporal feature fusion for robus video surveillance anomaly detection

**Authors:** S. Deepa Nivethika, Shreyash Joshi, Kshitij Verma, V. Aishwarya, Vimal Varshan Srinivasan, M. Senthil Pandian, Prabhakaran Paulraj

PMC · DOI: 10.1038/s41598-026-36130-z · Scientific Reports · 2026-02-10

## TL;DR

This paper introduces a new video surveillance system that better detects unusual activities by combining spatial and temporal analysis with attention mechanisms.

## Contribution

The novel contribution is an attention-guided spatio-temporal hybrid framework with an Adaptive Feature Fusion Module and Temporal Confidence Reweighting Loss.

## Key findings

- The proposed HybridModel-1 achieves 87.6% accuracy and 95.6% precision on surveillance benchmarks.
- The model outperforms spatial and temporal baselines in terms of robustness and temporal consistency.
- Ablation studies confirm the effectiveness of the fusion and temporal consistency mechanisms.

## Abstract

Dynamic object detection and tracking are essential components of intelligent video surveillance systems, enabling real-time monitoring and early identification of anomalous activities. Existing approaches often rely on either spatial appearance modeling or temporal sequence analysis, which limits robustness in crowded and dynamically evolving scenes. This study first evaluates representative spatial and temporal baseline models for theft detection, including an EfficientNetV2B0–HOG framework and a ConvLSTM-based temporal model, which achieve F1-scores of 0.86 and high recall but suffer from limited temporal consistency and sensitivity to data imbalance. To address these limitations, we propose an attention-guided spatio-temporal hybrid framework, referred to as HybridModel-1, which integrates object-level spatial detection with temporal motion modeling. The proposed model incorporates an Adaptive Feature Fusion Module (AFFM) to dynamically emphasize salient spatial features and a Temporal Confidence Reweighting Loss to suppress temporally inconsistent predictions. Evaluated on large-scale surveillance benchmarks including UCF-Crime, ShanghaiTech, and DCSASS, the proposed framework achieves an accuracy of 87.6%, a precision of 95.6%, a recall of 77.1%, and a ROC–AUC of 0.96, outperforming standalone spatial and temporal baselines. Ablation studies further confirm the effectiveness of the proposed fusion and temporal consistency mechanisms, demonstrating the model’s suitability for real-time surveillance applications.

## Full-text entities

- **Genes:** TRBV20OR9-2 (T cell receptor beta variable 20/OR9-2 (non-functional)) [NCBI Gene 6962] {aka CDR3, TCRBV20S2, TCRBV2O, TCRBV2S2O}
- **Diseases:** fatigue (MESH:D005221), visual anomaly (MESH:D014786)
- **Chemicals:** YOLO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12957453/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12957453/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/PMC12957453/full.md

---
Source: https://tomesphere.com/paper/PMC12957453