Exploring Scalability of Self-Training for Open-Vocabulary Temporal   Action Localization

Jeongseok Hyun; Su Ho Han; Hyolim Kang; Joon-Young Lee; Seon Joo Kim

arXiv:2407.07024·cs.CV·December 20, 2024

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim

PDF

Open Access 1 Repo

TL;DR

This paper investigates the use of self-training with unlabeled web videos to improve open-vocabulary temporal action localization, demonstrating enhanced generalizability and proposing a new evaluation benchmark.

Contribution

It introduces a scalable self-training framework for OV-TAL using pseudo-labels from unlabeled videos and proposes a new benchmark for comprehensive evaluation.

Findings

01

Self-training with web videos improves OV-TAL performance.

02

The new benchmark reveals limitations of existing evaluation schemes.

03

Gemini-1.5 achieves competitive results on the new benchmark.

Abstract

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyunjs/stov-tal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning

MethodsContrastive Language-Image Pre-training