Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
Shuyu Yang, Yilun Wang, Yaxiong Wang, Li Zhu, Zhedong Zheng

TL;DR
This paper introduces SVTA, a large-scale synthetic video-text dataset for anomaly retrieval, addressing data scarcity and privacy issues in real-world anomaly detection by generating diverse videos with paired descriptions.
Contribution
The paper presents SVTA, the first synthetic large-scale dataset for cross-modal anomaly retrieval, created using generative models and large language models to overcome data and privacy limitations.
Findings
SVTA contains 41,315 videos with 1.36 million frames.
The dataset covers 68 anomaly categories and 30 normal activities.
Baseline evaluations show SVTA's challenging nature and utility for robust retrieval methods.
Abstract
Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Video Analysis and Summarization · Multimodal Machine Learning Applications
