Open-Vocabulary Action Localization with Iterative Visual Prompting
Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu,, Katsushi Ikeuchi

TL;DR
This paper introduces a training-free, open-vocabulary method for video action localization using vision-language models and iterative visual prompting, reducing annotation costs and achieving competitive zero-shot performance.
Contribution
It presents a novel iterative visual prompting technique that enables VLMs to localize actions in videos without training, addressing long video processing challenges.
Findings
Achieves comparable zero-shot action localization results to state-of-the-art methods.
Demonstrates the effectiveness of iterative prompting in refining temporal boundaries.
Provides a practical, training-free approach for open-vocabulary video understanding.
Abstract
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
