Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Chaolei Han; Hongsong Wang; Jidong Kuang; Lei Zhang; Jie Gui

arXiv:2501.13795·cs.CV·May 19, 2026

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Chaolei Han, Hongsong Wang, Jidong Kuang, Lei Zhang, Jie Gui

PDF

1 Repo

TL;DR

This paper introduces a training-free, zero-shot temporal action detection method using vision-language models, achieving competitive results without additional training or fine-tuning.

Contribution

The paper proposes a novel training-free approach for zero-shot action detection that leverages vision-language models and introduces new scoring and adaptation strategies.

Findings

01

Outperforms state-of-the-art unsupervised methods on THUMOS14 and ActivityNet-1.3.

02

Requires only 1/13 of the runtime of comparable methods.

03

Test-time adaptation improves detection performance significantly.

Abstract

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Chaolei98/FreeZAD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods