VATEX: A Large-Scale, High-Quality Multilingual Dataset for   Video-and-Language Research

Xin Wang; Jiawei Wu; Junkun Chen; Lei Li; Yuan-Fang Wang; William Yang; Wang

arXiv:1904.03493·cs.CV·June 18, 2020·39 cites

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang, Wang

PDF

Open Access 4 Repos 10 Models

TL;DR

VATEX is a comprehensive multilingual video description dataset that enables research in multilingual captioning and video-guided machine translation, demonstrating improved performance and effective use of video context.

Contribution

The paper introduces VATEX, a large-scale, multilingual video dataset with parallel captions, and proposes two novel tasks leveraging this dataset for video-and-language research.

Findings

01

Multilingual models outperform monolingual counterparts in captioning.

02

Video context improves machine translation accuracy.

03

VATEX enables diverse and complex video-language research.

Abstract

We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition