Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Seyed Mahed Mousavi; Edoardo Cecchinato; Lucia Hornikova; Giuseppe Riccardi

arXiv:2506.23864·cs.CL·July 1, 2025

Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, Giuseppe Riccardi

PDF

Open Access 1 Video

TL;DR

This paper critically examines three reasoning benchmarks, revealing widespread flaws and demonstrating that current scores often reflect superficial cues rather than genuine reasoning, thus questioning the validity of benchmark-based reasoning claims in LLMs.

Contribution

The study systematically audits reasoning benchmarks, uncovers design flaws, and proposes improved evaluation protocols emphasizing reasoning as inference rather than output formatting.

Findings

01

Benchmark items often contain duplicated and ambiguous questions.

02

Model scores are highly sensitive to minor input variations.

03

Current evaluation methods may overestimate reasoning abilities.

Abstract

We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It· underline

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Computational and Text Analysis Methods