BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Yuzhe Tang

arXiv:2603.14761·cs.AI·March 18, 2026

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Yuzhe Tang

PDF

Open Access

TL;DR

BrainBench is a new benchmark of 100 questions designed to expose commonsense reasoning failures in large language models, revealing significant gaps between model performance and human-like reasoning.

Contribution

The paper introduces BrainBench, a comprehensive diagnostic benchmark targeting specific commonsense reasoning failure modes in LLMs, with extensive evaluation of leading models.

Findings

01

Top model achieves 80.3% accuracy

02

Models show a 6-16% gap between accuracy and consistency

03

Performance drops slightly in Chinese language evaluations

Abstract

Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)