Unsupervised Pre-training for Temporal Action Localization Tasks
Can Zhang, Tianyu Yang, Junwu Weng, Meng Cao, Jue Wang, Yuexian Zou

TL;DR
This paper introduces a novel self-supervised pretext task called Pseudo Action Localization (PAL) that improves unsupervised video representation learning specifically for temporal action localization by aligning features of pseudo-labeled regions.
Contribution
It proposes the first self-supervised pretraining method tailored for temporal action localization, bridging the gap between classification and localization tasks.
Findings
PAL significantly boosts TAL performance with large-scale unlabeled data.
The method introduces a temporal equivariant contrastive learning paradigm.
Extensive experiments validate PAL's effectiveness over existing approaches.
Abstract
Unsupervised video representation learning has made remarkable achievements in recent years. However, most existing methods are designed and optimized for video classification. These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization. To bridge this gap, we make the first attempt to propose a self-supervised pretext task, coined as Pseudo Action Localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action Localization tasks (UP-TAL). Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos. The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
MethodsContrastive Learning · ALIGN
