StatLLM: A Dataset for Evaluating the Performance of Large Language   Models in Statistical Analysis

Xinyi Song; Lina Lee; Kexin Xie; Xueying Liu; Xinwei Deng; Yili Hong

arXiv:2502.17657·stat.AP·February 26, 2025

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Xinyi Song, Lina Lee, Kexin Xie, Xueying Liu, Xinwei Deng, Yili Hong

PDF

Open Access 1 Repo

TL;DR

This paper introduces StatLLM, a comprehensive open-source dataset designed to evaluate and improve the accuracy of large language models in generating statistical analysis code, addressing a critical gap in benchmarking tools.

Contribution

The paper presents the first benchmark dataset for assessing LLMs in statistical coding, including tasks, generated code, and human evaluation scores, facilitating performance evaluation and improvement.

Findings

01

StatLLM enables systematic evaluation of LLMs in statistical coding.

02

Human evaluations highlight strengths and weaknesses of LLM-generated code.

03

The dataset supports development of better NLP metrics for code assessment.

Abstract

The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yili-hong/StatLLM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling

MethodsLLaMA