EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models

Shaima Ahmad Freja; Ferhat Ozgur Catak; Betul Yurdem; and Chunming Rong

arXiv:2602.02295·cs.LG·February 3, 2026

EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models

Shaima Ahmad Freja, Ferhat Ozgur Catak, Betul Yurdem, and Chunming Rong

PDF

Open Access

TL;DR

EvalQReason introduces a step-level reasoning evaluation framework for LLMs that analyzes reasoning dynamics without human labels, revealing domain-specific differences and improving correctness prediction.

Contribution

The paper presents EvalQReason, a novel framework with algorithms for quantifying LLM reasoning quality through step-level probability analysis, enhancing understanding of reasoning processes.

Findings

01

CSD features achieve high accuracy in correctness classification

02

Sequential models outperform classical machine learning approaches

03

Mathematical reasoning shows clear divergence patterns, medical reasoning less so

Abstract

Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)