M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining   Large Language Models

Wenxuan Zhang; Sharifah Mahani Aljunied; Chang Gao; Yew Ken Chia,; Lidong Bing

arXiv:2306.05179·cs.CL·November 13, 2023·31 cites

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia,, Lidong Bing

PDF

Open Access 1 Repo 10 Models 3 Datasets 1 Video

TL;DR

M3Exam is a comprehensive, multilingual, multimodal benchmark based on real human exams designed to evaluate large language models' abilities across language understanding, multimodal processing, and educational levels.

Contribution

The paper introduces M3Exam, a novel benchmark from real exams that assesses LLMs in multilingual, multimodal, and multilevel contexts, filling gaps in existing evaluation methods.

Findings

01

Current LLMs, including GPT-4, struggle with multilingual and low-resource languages.

02

Multimodal LLMs perform poorly on complex multimodal questions.

03

M3Exam provides a comprehensive platform for evaluating LLM development.

Abstract

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

damo-nlp-sg/m3exam
noneOfficial

Models

Datasets

Videos

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization