ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection
Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, Ngan, Le

TL;DR
ZEETAD introduces a novel approach for zero-shot temporal action detection by combining dual-localization and CLIP-based classification modules, effectively leveraging pretrained vision-language models to detect and recognize unseen actions in videos.
Contribution
The paper proposes ZEETAD, a new framework that enhances zero-shot TAD by integrating Transformer-based localization with CLIP-based semantic classification and minimal model updates.
Findings
Outperforms existing zero-shot TAD methods on THUMOS14 and ActivityNet-1.3 datasets.
Effectively transfers knowledge from vision-language models to unseen action categories.
Demonstrates the importance of dual-localization and lightweight adaptation for zero-shot video understanding.
Abstract
Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot TAD methods have limitations on how to properly construct the strong relationship between two interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. The former is a Transformer-based module that detects action events while selectively collecting crucial semantic embeddings for later recognition. The latter one, CLIP-based module, generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
MethodsContrastive Language-Image Pre-training
