Operational Robustness of LLMs on Code Generation
Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR
This paper introduces a new method called scenario domain analysis to evaluate the robustness of large language models in code generation, focusing on their sensitivity to changes in task descriptions.
Contribution
It proposes a formal, theoretically grounded robustness evaluation technique tailored for discrete natural language inputs in code generation tasks.
Findings
LLMs show decreased robustness with more complex tasks.
Robustness varies across different coding scenarios and topics.
The method effectively ranks LLMs by robustness.
Abstract
It is now common practice in software development for large language models (LLMs) to be used to generate program code. It is desirable to evaluate the robustness of LLMs for this usage. This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks. However, existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete. To address this problem, we propose a robustness evaluation method called scenario domain analysis, which aims to find the expected minimal change in the natural language descriptions of coding tasks that would cause the LLMs to produce incorrect outputs. We have formally proved the theoretical properties of the method and also conducted extensive experiments to evaluate the robustness of four state-of-the-art art LLMs:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Topic Modeling
