Changing Answer Order Can Decrease MMLU Accuracy

Vipul Gupta; David Pantoja; Candace Ross; Adina Williams; Megan Ung

arXiv:2406.19470·cs.CL·November 12, 2024·1 cites

Changing Answer Order Can Decrease MMLU Accuracy

Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung

PDF

Open Access

TL;DR

This paper reveals that changing answer order in multiple choice tests can significantly reduce large language models' accuracy on the MMLU benchmark, highlighting potential issues in current evaluation methods.

Contribution

The study demonstrates that answer order impacts LLM accuracy on MMLU, suggesting the need for more robust evaluation practices considering random chance.

Findings

01

Answer shuffling decreases model accuracy on MMLU

02

Models vary in sensitivity to answer order changes

03

Current evaluation methods may overestimate model performance

Abstract

As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNursing Diagnosis and Documentation