Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

Nahyun Lee; Guijin Son

arXiv:2604.14634·cs.CL·April 17, 2026

Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

Nahyun Lee, Guijin Son

PDF

1 Video

TL;DR

This paper introduces a large-scale multiple choice evaluation method with 100 options to better assess language model reliability and reveal hidden weaknesses not visible in traditional low-option tests.

Contribution

It proposes a novel evaluation protocol with 100 options, demonstrating its effectiveness in exposing model failures and biases that are hidden in standard benchmarks.

Findings

01

High performance in low option settings can overstate model competence.

02

Dense distractors reveal gaps in model understanding and biases.

03

Candidate ranking, not context length, is the main bottleneck.

Abstract

Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$ , revealing gaps that conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options· underline