Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

Liyang Peng; Sihan Zhu; Yunjie Guo

arXiv:2508.17442·cs.CV·August 26, 2025

Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

Liyang Peng, Sihan Zhu, Yunjie Guo

PDF

TL;DR

This paper presents ECVT, a novel video transformer architecture that integrates multi-level semantic guidance from large vision-language models to improve untrimmed video action recognition and localization.

Contribution

The introduction of ECVT, which combines dual-branch design and multi-granularity semantic cues from LVLMs, advancing the understanding of complex video actions.

Findings

01

Achieves state-of-the-art results on ActivityNet v1.3 and THUMOS14 datasets.

02

Significantly improves action recognition accuracy with semantic guidance.

03

Demonstrates effective temporal structure and event logic modeling.

Abstract

Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.