Open-Vocabulary Action Localization with Iterative Visual Prompting

Naoki Wake; Atsushi Kanehira; Kazuhiro Sasabuchi; Jun Takamatsu,; Katsushi Ikeuchi

arXiv:2408.17422·cs.CV·April 8, 2025

Open-Vocabulary Action Localization with Iterative Visual Prompting

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu,, Katsushi Ikeuchi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a training-free, open-vocabulary method for video action localization using vision-language models and iterative visual prompting, reducing annotation costs and achieving competitive zero-shot performance.

Contribution

It presents a novel iterative visual prompting technique that enables VLMs to localize actions in videos without training, addressing long video processing challenges.

Findings

01

Achieves comparable zero-shot action localization results to state-of-the-art methods.

02

Demonstrates the effectiveness of iterative prompting in refining temporal boundaries.

03

Provides a practical, training-free approach for open-vocabulary video understanding.

Abstract

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/VLM-Video-Action-Localization
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems