Language-free Training for Zero-shot Video Grounding
Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

TL;DR
This paper introduces a novel language-free training framework for zero-shot video grounding that leverages visual features and CLIP's aligned visual-language space, eliminating the need for annotated language data.
Contribution
It proposes a new training method that learns video grounding without language annotations by selecting temporal intervals and using visual features as pseudo-language, outperforming existing methods.
Findings
Outperforms existing zero-shot video grounding methods.
Surpasses several weakly-supervised approaches.
Demonstrates effectiveness on standard datasets.
Abstract
Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Language-free Training for Zero-shot Video Grounding· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
