OZ-TAL: Online Zero-Shot Temporal Action Localization
Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui

TL;DR
This paper introduces OZ-TAL, a new online zero-shot framework for real-time detection of unseen actions in streaming videos, leveraging vision-language models and establishing new benchmarks.
Contribution
It presents a training-free, online zero-shot action localization method using off-the-shelf VLMs, with mechanisms to improve visual features and reduce bias, plus new benchmarks.
Findings
Outperforms state-of-the-art methods in zero-shot online action detection.
Effective in both offline and online zero-shot settings.
Establishes new benchmarks on THUMOS14 and ActivityNet-1.3.
Abstract
Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
