AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei,, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang

TL;DR
AutoTVG introduces a novel pre-training paradigm for temporal video grounding that leverages automatically annotated untrimmed videos to improve zero-shot and supervised performance.
Contribution
It proposes AutoTVG, a new paradigm with a captioned moment generation module and a regression-based grounding network, addressing limitations of traditional pre-training methods.
Findings
Achieves competitive zero-shot performance on Charades-STA and ActivityNet Captions.
Outperforms existing pre-training frameworks with less training data.
Effectively learns semantic alignment and boundary regression from unannotated videos.
Abstract
Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
