Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance
Kuldeep Singh Yadav, Lalan Kumar

TL;DR
This paper introduces a large-scale dataset and a transformer-based multimodal framework for real-time suspiciousness estimation in visual surveillance, enhancing accuracy and interpretability in threat detection.
Contribution
It presents the USE50k dataset and DeepUSEvision framework, combining multimodal fusion and transformer networks for improved suspiciousness analysis in complex environments.
Findings
Superior accuracy over state-of-the-art methods
Robustness across diverse surveillance scenarios
Enhanced interpretability of suspiciousness scores
Abstract
Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Explainable Artificial Intelligence (XAI)
