M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia,, Lidong Bing

TL;DR
M3Exam is a comprehensive, multilingual, multimodal benchmark based on real human exams designed to evaluate large language models' abilities across language understanding, multimodal processing, and educational levels.
Contribution
The paper introduces M3Exam, a novel benchmark from real exams that assesses LLMs in multilingual, multimodal, and multilevel contexts, filling gaps in existing evaluation methods.
Findings
Current LLMs, including GPT-4, struggle with multilingual and low-resource languages.
Multimodal LLMs perform poorly on complex multimodal questions.
M3Exam provides a comprehensive platform for evaluating LLM development.
Abstract
Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SeaLLMs/SeaLLM-13B-Chatmodel· ♡ 64♡ 64
- 🤗SeaLLMs/SeaLLM-7B-v2model· 8.4k dl· ♡ 688.4k dl♡ 68
- 🤗LoneStriker/SeaLLM-7B-v2-GGUFmodel· 169 dl· ♡ 6169 dl♡ 6
- 🤗LoneStriker/SeaLLM-7B-v2-3.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/SeaLLM-7B-v2-4.0bpw-h6-exl2model· 4 dl4 dl
- 🤗LoneStriker/SeaLLM-7B-v2-5.0bpw-h6-exl2model· 5 dl5 dl
- 🤗LoneStriker/SeaLLM-7B-v2-6.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/SeaLLM-7B-v2-8.0bpw-h8-exl2model· 4 dl4 dl
- 🤗LoneStriker/SeaLLM-7B-v2-AWQmodel· 5 dl5 dl
- 🤗SeaLLMs/SeaLLM-7B-v2-ggufmodel· 53 dl· ♡ 953 dl♡ 9
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization
