Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

Crish Nagarkar; Leonid Bogachev; Serge Sharoff

arXiv:2601.14479·cs.CL·January 22, 2026

Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

Crish Nagarkar, Leonid Bogachev, Serge Sharoff

PDF

Open Access

TL;DR

This study evaluates the reasoning and self-assessment abilities of fine-tuned large language models on statistical tasks, comparing their performance to human benchmarks and highlighting their potential in education and research validation.

Contribution

It introduces fine-tuning of open-source LLMs for statistical reasoning and demonstrates their improved performance and self-evaluation capabilities compared to traditional metrics.

Findings

01

Fine-tuned models perform at a level comparable to statistics students.

02

Architecture-dependent improvements are observed with some models showing significant gains.

03

LLMs can effectively assess answer quality, surpassing traditional evaluation metrics.

Abstract

This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Statistics Education and Methodologies · Topic Modeling