ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Nearchos Potamitis; Lars Klein; Akhil Arora

arXiv:2512.07795·cs.AI·December 9, 2025

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Nearchos Potamitis, Lars Klein, Akhil Arora

PDF

Open Access

TL;DR

ReasonBENCH introduces a comprehensive benchmark to evaluate the stability and reproducibility of large language model reasoning, addressing the overlooked variability in performance across multiple runs and conditions.

Contribution

It provides a modular evaluation framework, multi-run protocols, and a public leaderboard to quantify and encourage variance-aware assessment of LLM reasoning methods.

Findings

01

Most reasoning strategies exhibit high instability.

02

Confidence intervals can be up to four times wider for similar performance.

03

Top methods often have higher, less stable costs.

Abstract

Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Explainable Artificial Intelligence (XAI)