EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models
Shaima Ahmad Freja, Ferhat Ozgur Catak, Betul Yurdem, and Chunming Rong

TL;DR
EvalQReason introduces a step-level reasoning evaluation framework for LLMs that analyzes reasoning dynamics without human labels, revealing domain-specific differences and improving correctness prediction.
Contribution
The paper presents EvalQReason, a novel framework with algorithms for quantifying LLM reasoning quality through step-level probability analysis, enhancing understanding of reasoning processes.
Findings
CSD features achieve high accuracy in correctness classification
Sequential models outperform classical machine learning approaches
Mathematical reasoning shows clear divergence patterns, medical reasoning less so
Abstract
Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)
