TL;DR
ESOM is a novel, training-free streaming video anomaly detection model that efficiently handles dynamic definitions and provides real-time performance with state-of-the-art results.
Contribution
The paper introduces ESOM, a training-free, efficient streaming OWVAD model with modules for normalization, token merging, memory, and scoring, plus a new benchmark dataset.
Findings
Achieves real-time efficiency on a single GPU.
Outperforms existing methods in anomaly localization and classification.
Provides accurate anomaly description generation.
Abstract
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
