Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Yongrae Jo; Seongyun Lee; Aiden SJ Lee; Hyunji Lee; Hanseok Oh,; Minjoon Seo

arXiv:2307.02682·cs.CV·July 13, 2023·5 cites

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Yongrae Jo, Seongyun Lee, Aiden SJ Lee, Hyunji Lee, Hanseok Oh,, Minjoon Seo

PDF

Open Access

TL;DR

ZeroTA enables zero-shot dense video captioning by jointly optimizing a language model and a vision-language contrastive model to localize and describe events without any training data.

Contribution

It introduces a novel zero-shot approach for dense video captioning that aligns language and vision models through joint optimization, eliminating the need for annotated training data.

Findings

01

Outperforms zero-shot baselines on ActivityNet Captions

02

Surpasses state-of-the-art few-shot methods

03

Demonstrates robustness in out-of-domain scenarios

Abstract

Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques