WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
Ahmed Elhady, Eneko Agirre, Mikel Artetxe

TL;DR
WiCkeD is a simple, automatic method that increases the difficulty of multiple-choice benchmarks by replacing options with 'None of the above', revealing model vulnerabilities and enhancing evaluation robustness.
Contribution
The paper introduces WiCkeD, a novel, easy-to-apply technique for making multiple-choice benchmarks more challenging and revealing model sensitivities.
Findings
Model performance drops by 12.1 points on average with WiCkeD.
WiCkeD challenges models with enhanced reasoning, similar to direct evaluation.
Uncovers differences in model sensitivity to reasoning complexity.
Abstract
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCustomer churn and segmentation
