Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization
Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim

TL;DR
This paper investigates the use of self-training with unlabeled web videos to improve open-vocabulary temporal action localization, demonstrating enhanced generalizability and proposing a new evaluation benchmark.
Contribution
It introduces a scalable self-training framework for OV-TAL using pseudo-labels from unlabeled videos and proposes a new benchmark for comprehensive evaluation.
Findings
Self-training with web videos improves OV-TAL performance.
The new benchmark reveals limitations of existing evaluation schemes.
Gemini-1.5 achieves competitive results on the new benchmark.
Abstract
The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
MethodsContrastive Language-Image Pre-training
