Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

Hyogun Lee; Haksub Kim; Ig-Jae Kim; Yonghun Choi

arXiv:2505.15205·cs.CV·May 26, 2025

Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

Hyogun Lee, Haksub Kim, Ig-Jae Kim, Yonghun Choi

PDF

3 Reviews

TL;DR

Flashback introduces a zero-shot, real-time video anomaly detection method inspired by human memory, using large language models offline to create a scene memory, enabling instant online anomaly detection without heavy computation.

Contribution

The paper presents Flashback, a novel zero-shot, real-time VAD framework that leverages large language models offline to enable fast online anomaly detection without relying on real anomaly data.

Findings

01

Achieves 87.3 AUC on UCF-Crime dataset.

02

Attains 75.1 AP on XD-Violence dataset.

03

Outperforms prior zero-shot VAD methods significantly.

Abstract

Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, two fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This paper proposes a novel and practical framework that effectively unifies zero-shot capability, real-time inference, and explainability. 2. The proposed model achieves SOTA zero-shot accuracy, outperforming prior works significantly, with high throughput (up to 43.8 fps). 3. The ablation studies convincingly validate key components such as repulsive prompting and memory scaling.

Weaknesses

1. The whole method heavily relies on proprietary models (GPT-4o, PerceptionEncoder) without ablation using open-source alternatives (e.g., CLIP, LLaMA), raising reproducibility concerns. 2. The runtime encoder selection mechanism is complex and poorly motivated; no comparison with simpler uncertainty metrics (e.g., entropy) is provided. 3. Ambiguity in the definition of “explanation”—whether it is the top-1 caption or the full top-K list, and how conflicting captions are handled.

Reviewer 02Rating 2Confidence 4

Strengths

+ The paper is technical sound. + The proposed model shows improved performance on both UCF-Crime and XD-Violence. + Some interesting visualisations such as Fig 4.

Weaknesses

- The review of existing works tend to be limited. What are the current challenges in this area, why existing methods are unable to address these issues, and how the proposed model handles these challenges are unclear. Although some of the insights are provided in the last few sentences per paragraph of the related work, it could be more clearly presented. - The method section is overall clearly written. It would be better to have a notation section detailing the maths symbols and operations us

Reviewer 03Rating 6Confidence 3

Strengths

1. **Novel and Practical Paradigm**: The major strength is its core idea of redefining VAD as a retrieval task over an offline text memory generated by an LLM. This is not only conceptually elegant but also highly practical as it directly addresses the bottleneck of online inference with VLM/LLM. 2. **Excellent Real-Time Performance**: The paper makes a strong commitment to "real-time" and shows high throughput (e.g., 43.8 fps).

Weaknesses

1. Ambiguity in Zero-Shot Definition: The method is claimed to be "strictly domain-agnostic," yet the use of domain-specific context prompts (e.g., "university campus" for ShanghaiTech) during memory construction implies reliance on target-domain knowledge. This conflicts with the standard zero-shot assumption and may limit true plug-and-play applicability in completely unseen environments. 2. Dependence on LLM Knowledge and Coverage: Flashback can only detect anomalies that are pre-generated i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.