A Survey of Current Datasets for Vision and Language Research
Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao (Kenneth) Huang, Lucy, Vanderwende, Jacob Devlin, Michel Galley, Margaret Mitchell

TL;DR
This survey reviews recent datasets for vision and language AI, proposing quality metrics and categorization, highlighting their evolving complexity and diverse strengths and weaknesses.
Contribution
It introduces a set of quality metrics for evaluating vision-language datasets and categorizes them, providing a comprehensive analysis of their characteristics and progress.
Findings
Recent datasets use more complex language and abstract concepts
Different datasets exhibit unique strengths and weaknesses
The proposed metrics help evaluate dataset quality effectively
Abstract
Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research. In this paper, we propose a set of quality metrics for evaluating and analyzing the vision & language datasets and categorize them accordingly. Our analyses show that the most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
