Aligning Effective Tokens with Video Anomaly in Large Language Models
Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xiaojuan Qi, Yik-Chung Wu

TL;DR
This paper introduces VA-GPT, a multi-modal large language model designed to improve video anomaly detection by effectively aligning visual and language tokens, utilizing novel modules for spatial and temporal analysis, and establishing a new benchmark.
Contribution
The paper proposes VA-GPT with Spatial Effective Token Selection and Temporal Effective Token Generation modules, and creates a new dataset and benchmark for video anomaly detection.
Findings
VA-GPT outperforms existing methods on benchmarks.
Effective token alignment improves anomaly localization.
New dataset enhances fine-tuning of video-anomaly models.
Abstract
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
