An Empirical Study of the Non-determinism of ChatGPT in Code Generation

Shuyin Ouyang; Jie M. Zhang; Mark Harman; Meng Wang

arXiv:2308.02828·cs.SE·October 18, 2024·49 cites

An Empirical Study of the Non-determinism of ChatGPT in Code Generation

Shuyin Ouyang, Jie M. Zhang, Mark Harman, Meng Wang

PDF

Open Access

TL;DR

This study empirically demonstrates that ChatGPT's code generation is highly non-deterministic across multiple benchmarks, posing challenges to scientific validity and emphasizing the need for researchers to account for this variability.

Contribution

It provides the first comprehensive empirical analysis of ChatGPT's non-determinism in code generation across three benchmarks, highlighting its impact on research reliability.

Findings

01

75.76% of tasks with zero identical test outputs in CodeContests

02

Setting temperature to 0 reduces non-determinism but does not eliminate it

03

High non-determinism threatens the validity of scientific conclusions in LLM research

Abstract

There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices