StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis
Xinyi Song, Lina Lee, Kexin Xie, Xueying Liu, Xinwei Deng, Yili Hong

TL;DR
This paper introduces StatLLM, a comprehensive open-source dataset designed to evaluate and improve the accuracy of large language models in generating statistical analysis code, addressing a critical gap in benchmarking tools.
Contribution
The paper presents the first benchmark dataset for assessing LLMs in statistical coding, including tasks, generated code, and human evaluation scores, facilitating performance evaluation and improvement.
Findings
StatLLM enables systematic evaluation of LLMs in statistical coding.
Human evaluations highlight strengths and weaknesses of LLM-generated code.
The dataset supports development of better NLP metrics for code assessment.
Abstract
The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling
MethodsLLaMA
