GEM: A General Evaluation Benchmark for Multimodal Tasks

Lin Su; Nan Duan; Edward Cui; Lei Ji; Chenfei Wu and; Huaishao Luo; Yongfei Liu; Ming Zhong; Taroon Bharti; Arun; Sacheti

arXiv:2106.09889·cs.CL·June 21, 2021

GEM: A General Evaluation Benchmark for Multimodal Tasks

Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu and, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun, Sacheti

PDF

Open Access 1 Repo

TL;DR

GEM is a comprehensive, multilingual benchmark for evaluating multimodal tasks across image and video language understanding, filling a gap left by existing datasets.

Contribution

It introduces GEM, a large-scale, multilingual vision-language benchmark covering both image and video tasks, with baseline models to facilitate research.

Findings

01

Largest multimodal dataset covering image and video tasks

02

Multilingual annotations across multiple languages

03

Baseline models provided for future research

Abstract

In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/GEM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques