Multi-Level LVLM Guidance for Untrimmed Video Action Recognition
Liyang Peng, Sihan Zhu, Yunjie Guo

TL;DR
This paper presents ECVT, a novel video transformer architecture that integrates multi-level semantic guidance from large vision-language models to improve untrimmed video action recognition and localization.
Contribution
The introduction of ECVT, which combines dual-branch design and multi-granularity semantic cues from LVLMs, advancing the understanding of complex video actions.
Findings
Achieves state-of-the-art results on ActivityNet v1.3 and THUMOS14 datasets.
Significantly improves action recognition accuracy with semantic guidance.
Demonstrates effective temporal structure and event logic modeling.
Abstract
Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
