VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li, Shuwei He, Zihan Xu

TL;DR
This paper introduces VT-LVLM-AR, a novel framework that converts long-term videos into semantic event sequences and employs a large vision-language model with prompt tuning for fine-grained action recognition, achieving state-of-the-art results.
Contribution
The paper presents a new Video-to-Event Mapper and adapts a frozen LVLM with prompt tuning for efficient, interpretable action recognition in long videos.
Findings
Achieves 94.1% accuracy on NTU RGB+D X-Sub dataset.
Outperforms existing methods in fine-grained action recognition.
Demonstrates the effectiveness of visual event sequences and prompt tuning.
Abstract
Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent "visual event…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
