Language-free Training for Zero-shot Video Grounding

Dahye Kim; Jungin Park; Jiyoung Lee; Seongheon Park; Kwanghoon Sohn

arXiv:2210.12977·cs.CV·October 25, 2022

Language-free Training for Zero-shot Video Grounding

Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel language-free training framework for zero-shot video grounding that leverages visual features and CLIP's aligned visual-language space, eliminating the need for annotated language data.

Contribution

It proposes a new training method that learns video grounding without language annotations by selecting temporal intervals and using visual features as pseudo-language, outperforming existing methods.

Findings

01

Outperforms existing zero-shot video grounding methods.

02

Surpasses several weakly-supervised approaches.

03

Demonstrates effectiveness on standard datasets.

Abstract

Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Language-free Training for Zero-shot Video Grounding· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training