Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment
Yongrae Jo, Seongyun Lee, Aiden SJ Lee, Hyunji Lee, Hanseok Oh,, Minjoon Seo

TL;DR
ZeroTA enables zero-shot dense video captioning by jointly optimizing a language model and a vision-language contrastive model to localize and describe events without any training data.
Contribution
It introduces a novel zero-shot approach for dense video captioning that aligns language and vision models through joint optimization, eliminating the need for annotated training data.
Findings
Outperforms zero-shot baselines on ActivityNet Captions
Surpasses state-of-the-art few-shot methods
Demonstrates robustness in out-of-domain scenarios
Abstract
Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
