Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen; Jiahui Liu; Ruidi Fan; Yanwei Li; Chirui Chang; Shizhen Zhao; Wilton W.T. Fok; Xiaojuan Qi; Yik-Chung Wu

arXiv:2508.06350·cs.CV·November 4, 2025

Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xiaojuan Qi, Yik-Chung Wu

PDF

Open Access

TL;DR

This paper introduces VA-GPT, a multi-modal large language model designed to improve video anomaly detection by effectively aligning visual and language tokens, utilizing novel modules for spatial and temporal analysis, and establishing a new benchmark.

Contribution

The paper proposes VA-GPT with Spatial Effective Token Selection and Temporal Effective Token Generation modules, and creates a new dataset and benchmark for video anomaly detection.

Findings

01

VA-GPT outperforms existing methods on benchmarks.

02

Effective token alignment improves anomaly localization.

03

New dataset enhances fine-tuning of video-anomaly models.

Abstract

Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications