AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video   Grounding

Xing Zhang; Jiaxi Gu; Haoyu Zhao; Shicong Wang; Hang Xu; Renjing Pei,; Songcen Xu; Zuxuan Wu; Yu-Gang Jiang

arXiv:2406.07091·cs.CV·June 12, 2024

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei,, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

AutoTVG introduces a novel pre-training paradigm for temporal video grounding that leverages automatically annotated untrimmed videos to improve zero-shot and supervised performance.

Contribution

It proposes AutoTVG, a new paradigm with a captioned moment generation module and a regression-based grounding network, addressing limitations of traditional pre-training methods.

Findings

01

Achieves competitive zero-shot performance on Charades-STA and ActivityNet Captions.

02

Outperforms existing pre-training frameworks with less training data.

03

Effectively learns semantic alignment and boundary regression from unannotated videos.

Abstract

Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization