Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks
Crish Nagarkar, Leonid Bogachev, Serge Sharoff

TL;DR
This study evaluates the reasoning and self-assessment abilities of fine-tuned large language models on statistical tasks, comparing their performance to human benchmarks and highlighting their potential in education and research validation.
Contribution
It introduces fine-tuning of open-source LLMs for statistical reasoning and demonstrates their improved performance and self-evaluation capabilities compared to traditional metrics.
Findings
Fine-tuned models perform at a level comparable to statistics students.
Architecture-dependent improvements are observed with some models showing significant gains.
LLMs can effectively assess answer quality, surpassing traditional evaluation metrics.
Abstract
This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Statistics Education and Methodologies · Topic Modeling
