TL;DR
Flashback introduces a zero-shot, real-time video anomaly detection method inspired by human memory, using large language models offline to create a scene memory, enabling instant online anomaly detection without heavy computation.
Contribution
The paper presents Flashback, a novel zero-shot, real-time VAD framework that leverages large language models offline to enable fast online anomaly detection without relying on real anomaly data.
Findings
Achieves 87.3 AUC on UCF-Crime dataset.
Attains 75.1 AP on XD-Violence dataset.
Outperforms prior zero-shot VAD methods significantly.
Abstract
Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, two fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper proposes a novel and practical framework that effectively unifies zero-shot capability, real-time inference, and explainability. 2. The proposed model achieves SOTA zero-shot accuracy, outperforming prior works significantly, with high throughput (up to 43.8 fps). 3. The ablation studies convincingly validate key components such as repulsive prompting and memory scaling.
1. The whole method heavily relies on proprietary models (GPT-4o, PerceptionEncoder) without ablation using open-source alternatives (e.g., CLIP, LLaMA), raising reproducibility concerns. 2. The runtime encoder selection mechanism is complex and poorly motivated; no comparison with simpler uncertainty metrics (e.g., entropy) is provided. 3. Ambiguity in the definition of “explanation”—whether it is the top-1 caption or the full top-K list, and how conflicting captions are handled.
+ The paper is technical sound. + The proposed model shows improved performance on both UCF-Crime and XD-Violence. + Some interesting visualisations such as Fig 4.
- The review of existing works tend to be limited. What are the current challenges in this area, why existing methods are unable to address these issues, and how the proposed model handles these challenges are unclear. Although some of the insights are provided in the last few sentences per paragraph of the related work, it could be more clearly presented. - The method section is overall clearly written. It would be better to have a notation section detailing the maths symbols and operations us
1. **Novel and Practical Paradigm**: The major strength is its core idea of redefining VAD as a retrieval task over an offline text memory generated by an LLM. This is not only conceptually elegant but also highly practical as it directly addresses the bottleneck of online inference with VLM/LLM. 2. **Excellent Real-Time Performance**: The paper makes a strong commitment to "real-time" and shows high throughput (e.g., 43.8 fps).
1. Ambiguity in Zero-Shot Definition: The method is claimed to be "strictly domain-agnostic," yet the use of domain-specific context prompts (e.g., "university campus" for ShanghaiTech) during memory construction implies reliance on target-domain knowledge. This conflicts with the standard zero-shot assumption and may limit true plug-and-play applicability in completely unseen environments. 2. Dependence on LLM Knowledge and Coverage: Flashback can only detect anomalies that are pre-generated i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
