Changing Answer Order Can Decrease MMLU Accuracy
Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung

TL;DR
This paper reveals that changing answer order in multiple choice tests can significantly reduce large language models' accuracy on the MMLU benchmark, highlighting potential issues in current evaluation methods.
Contribution
The study demonstrates that answer order impacts LLM accuracy on MMLU, suggesting the need for more robust evaluation practices considering random chance.
Findings
Answer shuffling decreases model accuracy on MMLU
Models vary in sensitivity to answer order changes
Current evaluation methods may overestimate model performance
Abstract
As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNursing Diagnosis and Documentation
