VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

Kaining Li; Shuwei He; Zihan Xu

arXiv:2508.15903·cs.CV·August 25, 2025

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

Kaining Li, Shuwei He, Zihan Xu

PDF

TL;DR

This paper introduces VT-LVLM-AR, a novel framework that converts long-term videos into semantic event sequences and employs a large vision-language model with prompt tuning for fine-grained action recognition, achieving state-of-the-art results.

Contribution

The paper presents a new Video-to-Event Mapper and adapts a frozen LVLM with prompt tuning for efficient, interpretable action recognition in long videos.

Findings

01

Achieves 94.1% accuracy on NTU RGB+D X-Sub dataset.

02

Outperforms existing methods in fine-grained action recognition.

03

Demonstrates the effectiveness of visual event sequences and prompt tuning.

Abstract

Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent "visual event…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.