ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Nearchos Potamitis, Lars Klein, Akhil Arora

TL;DR
ReasonBENCH introduces a comprehensive benchmark to evaluate the stability and reproducibility of large language model reasoning, addressing the overlooked variability in performance across multiple runs and conditions.
Contribution
It provides a modular evaluation framework, multi-run protocols, and a public leaderboard to quantify and encourage variance-aware assessment of LLM reasoning methods.
Findings
Most reasoning strategies exhibit high instability.
Confidence intervals can be up to four times wider for similar performance.
Top methods often have higher, less stable costs.
Abstract
Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Explainable Artificial Intelligence (XAI)
