Plausibly Problematic Questions in Multiple-Choice Benchmarks for   Commonsense Reasoning

Shramay Palta; Nishant Balepur; Peter Rankel; Sarah Wiegreffe; Marine; Carpuat; Rachel Rudinger

arXiv:2410.10854·cs.CL·October 16, 2024

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter Rankel, Sarah Wiegreffe, Marine, Carpuat, Rachel Rudinger

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the discrepancy between the most plausible answers and gold answers in commonsense MCQ benchmarks, revealing issues like ambiguity and semantic mismatch, and proposes plausibility judgments as a way to improve benchmark reliability.

Contribution

It introduces a method for collecting plausibility judgments for MCQ answers and demonstrates its effectiveness in identifying problematic questions in commonsense reasoning benchmarks.

Findings

01

Over 20% of questions have most plausible answers differing from gold answers.

02

Problems like ambiguity and semantic mismatch are prevalent in these questions.

03

Plausibility judgments can help identify more reliable benchmark items.

Abstract

Questions involving commonsense reasoning about everyday situations often admit many $possible$ or $plausible$ answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the $most$ plausible answer choice. On $250$ MCQ items sampled from two commonsense reasoning benchmarks, we collect $5, 000$ independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shramay-palta/commonsense-mcq-plausibility
noneOfficial

Videos

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning· underline

Taxonomy

TopicsLogic, Reasoning, and Knowledge