Zero-shot Action Localization via the Confidence of Large Vision-Language Models
Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy

TL;DR
This paper introduces ZEAL, a zero-shot action localization method leveraging large vision-language models and language models to generate detailed action descriptions, enabling frame-level localization without training on specific datasets.
Contribution
The paper presents a novel zero-shot localization approach that uses large language models to create detailed action descriptions for effective video localization.
Findings
Achieves strong zero-shot localization performance on challenging benchmarks.
Does not require any training data for localization.
Demonstrates flexibility with different large vision-language models.
Abstract
Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of LVLMs, with their powerful visual question answering capabilities, to zero-shot localization in long-form video is still relatively unexplored. To this end, we introduce a true Zero-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
