Are Your LLMs Capable of Stable Reasoning?

Junnan Liu; Hongwei Liu; Linchen Xiao; Ziyi Wang; Kuikun Liu; Songyang Gao; Wenwei Zhang; Songyang Zhang; Kai Chen

arXiv:2412.13147·cs.AI·August 11, 2025

Are Your LLMs Capable of Stable Reasoning?

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces G-Pass@$k$, a new evaluation metric for large language models that measures both their reasoning accuracy and stability across multiple attempts, highlighting gaps in current evaluation methods.

Contribution

The paper proposes G-Pass@$k$, a novel metric for assessing LLM reasoning performance and stability, and demonstrates its effectiveness through extensive experiments.

Findings

01

G-Pass@$k$ provides a more comprehensive evaluation of LLM reasoning capabilities.

02

Current benchmarks may overestimate LLM performance without stability considerations.

03

Enhanced evaluation metrics can guide better development of robust LLMs.

Abstract

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@ $k$ , a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model's performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@ $k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/gpassk
pytorchOfficial

Models

🤗
jnanliu/LiveMath-Judge
model· 3 dl· ♡ 1
3 dl♡ 1

Datasets

opencompass/LiveMathBench
dataset· 871 dl
871 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law