GEM: A General Evaluation Benchmark for Multimodal Tasks
Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu and, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun, Sacheti

TL;DR
GEM is a comprehensive, multilingual benchmark for evaluating multimodal tasks across image and video language understanding, filling a gap left by existing datasets.
Contribution
It introduces GEM, a large-scale, multilingual vision-language benchmark covering both image and video tasks, with baseline models to facilitate research.
Findings
Largest multimodal dataset covering image and video tasks
Multilingual annotations across multiple languages
Baseline models provided for future research
Abstract
In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
