Auto-captions on GIF: A Large-scale Video-sentence Dataset for   Vision-language Pre-training

Yingwei Pan; Yehao Li; Jianjie Luo; Jun Xu; Ting Yao and; Tao Mei

arXiv:2007.02375·cs.CV·July 7, 2020·27 cites

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao and, Tao Mei

PDF

Open Access

TL;DR

This paper introduces Auto-captions on GIF, a large-scale dataset created from web data for pre-training models in video understanding and captioning, demonstrating its effectiveness across multiple downstream tasks.

Contribution

The paper presents a new large-scale, automatically generated video-sentence dataset and evaluates a Transformer-based pre-training approach for vision-language tasks.

Findings

01

Effective pre-training for video captioning

02

Strong generalization on MSR-VTT

03

Comparison with existing datasets shows advantages

Abstract

In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling