Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It
Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, Giuseppe Riccardi

TL;DR
This paper critically examines three reasoning benchmarks, revealing widespread flaws and demonstrating that current scores often reflect superficial cues rather than genuine reasoning, thus questioning the validity of benchmark-based reasoning claims in LLMs.
Contribution
The study systematically audits reasoning benchmarks, uncovers design flaws, and proposes improved evaluation protocols emphasizing reasoning as inference rather than output formatting.
Findings
Benchmark items often contain duplicated and ambiguous questions.
Model scores are highly sensitive to minor input variations.
Current evaluation methods may overestimate reasoning abilities.
Abstract
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Computational and Text Analysis Methods
