WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More   Challenging

Ahmed Elhady; Eneko Agirre; Mikel Artetxe

arXiv:2502.18316·cs.CL·February 26, 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

PDF

Open Access 1 Repo 1 Models

TL;DR

WiCkeD is a simple, automatic method that increases the difficulty of multiple-choice benchmarks by replacing options with 'None of the above', revealing model vulnerabilities and enhancing evaluation robustness.

Contribution

The paper introduces WiCkeD, a novel, easy-to-apply technique for making multiple-choice benchmarks more challenging and revealing model sensitivities.

Findings

01

Model performance drops by 12.1 points on average with WiCkeD.

02

WiCkeD challenges models with enhanced reasoning, similar to direct evaluation.

03

Uncovers differences in model sensitivity to reasoning complexity.

Abstract

We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahmedselhady/wicked-benchmarks
noneOfficial

Models

🤗
ahmedselhady/bert-base-uncased-sba-clf
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCustomer churn and segmentation