Zero-shot Action Localization via the Confidence of Large   Vision-Language Models

Josiah Aklilu; Xiaohan Wang; Serena Yeung-Levy

arXiv:2410.14340·cs.CV·March 26, 2025

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy

PDF

Open Access

TL;DR

This paper introduces ZEAL, a zero-shot action localization method leveraging large vision-language models and language models to generate detailed action descriptions, enabling frame-level localization without training on specific datasets.

Contribution

The paper presents a novel zero-shot localization approach that uses large language models to create detailed action descriptions for effective video localization.

Findings

01

Achieves strong zero-shot localization performance on challenging benchmarks.

02

Does not require any training data for localization.

03

Demonstrates flexibility with different large vision-language models.

Abstract

Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of LVLMs, with their powerful visual question answering capabilities, to zero-shot localization in long-form video is still relatively unexplored. To this end, we introduce a true Zero-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications