Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

Zhaolin Cai; Fan Li; Huiyu Duan; Lijun He; Guangtao Zhai

arXiv:2602.24021·cs.CV·March 2, 2026

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

Zhaolin Cai, Fan Li, Huiyu Duan, Lijun He, Guangtao Zhai

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SteerVAD, a novel framework that actively steers and rectifies internal representations of frozen multi-modal large language models to improve video anomaly detection, achieving state-of-the-art results with minimal training data.

Contribution

The paper proposes a new intervention framework, SteerVAD, that enhances MLLM-based VAD by identifying key attention heads and applying targeted representation scaling to improve anomaly detection.

Findings

01

Achieves state-of-the-art performance among tuning-free methods.

02

Requires only 1% of training data for effective detection.

03

Demonstrates robustness across mainstream benchmarks.

Abstract

Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1.The paper proposes a novel paradigm shifting from "passive feature reading" to "active geometric intervention." It elegantly addresses pre-training bias by identifying LAEs via RSA and applying dynamic, context-aware rectification using the HMC. This core hypothesis is strongly supported by t-SNE visualizations (Figure 4), which show the feature space transforming from "entangled" to "linearly separable," and by attention heatmaps (Figure 5, 9, 10), which confirm this geometric reshaping leads

Weaknesses

1.There is a conceptual contradiction in the paper's theory. The theoretical foundation (Sec 3.1, Appendix A) emphasizes the need to model complex, non-convex, and entangled manifolds, which is confirmed by visualizations in Figure 2. However, the RSA metric (Eq. 1) used to discover these structures is a simple, linear metric based on centroids and variance, which implicitly assumes the manifolds are Gaussian-like and convex. This use of a simple linear metric to solve a complex non-linear probl

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper introduces a fresh and well-motivated idea — geometric steering and rectification of latent representation manifolds in frozen multi-modal LLMs — offering a new paradigm for tuning-free adaptation. 2. The proposed RSA and HMC modules are conceptually clear, lightweight, and technically sound. They enable dynamic, context-aware control over latent representations without modifying the backbone parameters. 3. SteerVAD achieves state-of-the-art performance among all tuning-free methods

Weaknesses

While the paper presents an innovative and well-designed framework, several conceptual and methodological issues remain unclear. 1. The assumption that video datasets form closed and bounded manifolds is theoretically strong and may not hold in practice. Real-world video data are inherently open-set and dynamic, so the claimed manifold topology should be regarded as an approximation rather than a strict property. 2. The paper assumes an invariant conditional distribution $P(Y|V,Z)$ between train

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is well-structured and clearly articulated. 2. The tuning-free approach is novel, by leveraging pre-trained representations through selection and scaling. 3. Ablation studies on RSA and HMC effectively validate the design choices.

Weaknesses

1. Limited Ablation on Calibration Set: The rectified representations heavily depend on HMC tuning via the calibration set. However, the paper lacks critical ablation studies on: 1) Diversity: Performance impact of varying calibration set compositions (e.g., random subsets, different sizes); 2) Class Balance: Sensitivity to ratios of normal/abnormal videos in the calibration set. 2. Fair Comparison Issues: Table 1 should include parameter counts for specific backbones (especially Multimodal VAD

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Video Analysis and Summarization