VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang, Wang

TL;DR
VATEX is a comprehensive multilingual video description dataset that enables research in multilingual captioning and video-guided machine translation, demonstrating improved performance and effective use of video context.
Contribution
The paper introduces VATEX, a large-scale, multilingual video dataset with parallel captions, and proposes two novel tasks leveraging this dataset for video-and-language research.
Findings
Multilingual models outperform monolingual counterparts in captioning.
Video context improves machine translation accuracy.
VATEX enables diverse and complex video-language research.
Abstract
We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/paligemma2-3b-ft-docci-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma2-10b-ft-docci-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma2-3b-mix-224model· 43k dl· ♡ 4843k dl♡ 48
- 🤗google/paligemma2-3b-mix-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma2-3b-ft-docci-448model· 36k dl· ♡ 1336k dl♡ 13
- 🤗google/paligemma2-10b-ft-docci-448model· 927 dl· ♡ 17927 dl♡ 17
- 🤗google/paligemma2-10b-mix-224-jaxmodel
- 🤗google/paligemma2-3b-mix-448model· 3.8k dl· ♡ 573.8k dl♡ 57
- 🤗google/paligemma2-10b-mix-224model· 194 dl· ♡ 10194 dl♡ 10
- 🤗google/paligemma2-10b-mix-448model· 551 dl· ♡ 35551 dl♡ 35
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition
