MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra,, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li,, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

TL;DR
MMLU-Pro is an enhanced benchmark that introduces more challenging, reasoning-focused questions with expanded options, providing a more discriminative and stable evaluation for language models' understanding and reasoning capabilities.
Contribution
The paper presents MMLU-Pro, a revised dataset with increased difficulty, reduced noise, and expanded answer choices, improving the assessment of language models' reasoning skills.
Findings
MMLU-Pro causes a 16-33% drop in model accuracy, indicating higher difficulty.
Model score sensitivity to prompt variations decreases from 4-5% to 2% with MMLU-Pro.
Chain of Thought reasoning improves performance on MMLU-Pro, unlike on the original MMLU.
Abstract
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ghost-x/ghost-8b-betamodel· 621 dl· ♡ 14621 dl♡ 14
- 🤗FlorianJc/ghost-8b-beta-vllm-fp8model· 1 dl1 dl
- 🤗QuantFactory/ghost-8b-beta-GGUFmodel· 177 dl· ♡ 1177 dl♡ 1
- 🤗akjindal53244/Llama-3.1-Storm-8Bmodel· 2.2k dl· ♡ 1772.2k dl♡ 177
- 🤗akjindal53244/Llama-3.1-Storm-8B-FP8-Dynamicmodel· 9 dl· ♡ 149 dl♡ 14
- 🤗RichardErkhov/ghost-x_-_ghost-8b-beta-ggufmodel· 30 dl30 dl
- 🤗akjindal53244/Llama-3.1-Storm-8B-GGUFmodel· 237 dl· ♡ 41237 dl♡ 41
- 🤗ghost-x/ghost-8b-beta-1608model· 8.8k dl· ♡ 348.8k dl♡ 34
- 🤗ghost-x/ghost-8b-beta-1608-awqmodel· 4 dl4 dl
- 🤗ghost-x/ghost-8b-beta-1608-ggufmodel· 94 dl· ♡ 694 dl♡ 6
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsSparse Evolutionary Training
