WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual   Machine Learning

Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky,; Marc Najork

arXiv:2103.01913·cs.CV·February 21, 2023

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky,, Marc Najork

PDF

3 Repos 10 Models 5 Datasets

TL;DR

The WIT dataset is a large, multilingual, and diverse visio-linguistic dataset derived from Wikipedia, designed to advance multimodal and multilingual machine learning with its extensive image-text pairs and real-world complexity.

Contribution

This paper introduces WIT, the largest and most multilingual visio-linguistic dataset from Wikipedia, enabling improved pretraining and evaluation for multimodal multilingual models.

Findings

01

WIT is the largest multimodal dataset with 37.6 million examples.

02

WIT covers over 100 languages, each with at least 12K examples.

03

WIT provides a challenging real-world test set for image-text retrieval.

Abstract

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.