Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs
Jiaxin Cui, Rohan Alexander

TL;DR
This study systematically evaluates the reproducibility of data analysis performed by Large Language Models, revealing significant variation in results even under consistent configurations, emphasizing the need for multiple runs and result distribution analysis.
Contribution
It provides a comprehensive assessment of LLMs' data analysis reproducibility across various models, prompts, and settings, highlighting variability and best practices.
Findings
Significant variation in analysis results across runs
Reproducibility issues similar to human data analysis
Multiple executions recommended for reliable results
Abstract
We systematically evaluate the reproducibility of data analysis conducted by Large Language Models (LLMs). We evaluate two prompting strategies, six models, and four temperature settings, with ten independent executions per configuration, yielding 480 total attempts. We assess the completion, concordance, validity, and consistency of each attempt and find considerable variation in the analytical results even for consistent configurations. This suggests, as with human data analysis, the data analysis conducted by LLMs can vary, even given the same task, data, and settings. Our results mean that if an LLM is being used to conduct data analysis, then it should be run multiple times independently and the distribution of results considered.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education
