An Empirical Study of the Non-determinism of ChatGPT in Code Generation
Shuyin Ouyang, Jie M. Zhang, Mark Harman, Meng Wang

TL;DR
This study empirically demonstrates that ChatGPT's code generation is highly non-deterministic across multiple benchmarks, posing challenges to scientific validity and emphasizing the need for researchers to account for this variability.
Contribution
It provides the first comprehensive empirical analysis of ChatGPT's non-determinism in code generation across three benchmarks, highlighting its impact on research reliability.
Findings
75.76% of tasks with zero identical test outputs in CodeContests
Setting temperature to 0 reduces non-determinism but does not eliminate it
High non-determinism threatens the validity of scientific conclusions in LLM research
Abstract
There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices
