Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs

Jiaxin Cui; Rohan Alexander

arXiv:2602.14349·stat.AP·February 17, 2026

Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs

Jiaxin Cui, Rohan Alexander

PDF

Open Access

TL;DR

This study systematically evaluates the reproducibility of data analysis performed by Large Language Models, revealing significant variation in results even under consistent configurations, emphasizing the need for multiple runs and result distribution analysis.

Contribution

It provides a comprehensive assessment of LLMs' data analysis reproducibility across various models, prompts, and settings, highlighting variability and best practices.

Findings

01

Significant variation in analysis results across runs

02

Reproducibility issues similar to human data analysis

03

Multiple executions recommended for reliable results

Abstract

We systematically evaluate the reproducibility of data analysis conducted by Large Language Models (LLMs). We evaluate two prompting strategies, six models, and four temperature settings, with ten independent executions per configuration, yielding 480 total attempts. We assess the completion, concordance, validity, and consistency of each attempt and find considerable variation in the analytical results even for consistent configurations. This suggests, as with human data analysis, the data analysis conducted by LLMs can vary, even given the same task, data, and settings. Our results mean that if an LLM is being used to conduct data analysis, then it should be run multiple times independently and the distribution of results considered.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education