ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot   End-to-End Temporal Action Detection

Thinh Phan; Khoa Vo; Duy Le; Gianfranco Doretto; Donald Adjeroh; Ngan; Le

arXiv:2311.00729·cs.CV·November 8, 2023·1 cites

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, Ngan, Le

PDF

Open Access 1 Video

TL;DR

ZEETAD introduces a novel approach for zero-shot temporal action detection by combining dual-localization and CLIP-based classification modules, effectively leveraging pretrained vision-language models to detect and recognize unseen actions in videos.

Contribution

The paper proposes ZEETAD, a new framework that enhances zero-shot TAD by integrating Transformer-based localization with CLIP-based semantic classification and minimal model updates.

Findings

01

Outperforms existing zero-shot TAD methods on THUMOS14 and ActivityNet-1.3 datasets.

02

Effectively transfers knowledge from vision-language models to unseen action categories.

03

Demonstrates the importance of dual-localization and lightweight adaptation for zero-shot video understanding.

Abstract

Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot TAD methods have limitations on how to properly construct the strong relationship between two interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. The former is a Transformer-based module that detects action events while selectively collecting crucial semantic embeddings for later recognition. The latter one, CLIP-based module, generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications

MethodsContrastive Language-Image Pre-training