Large Language Models of Code Fail at Completing Code with Potential Bugs
Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen,, Sheng Zha, George Karypis

TL;DR
This paper investigates how large language models of code perform when the code context contains potential bugs, revealing significant performance degradation and exploring mitigation strategies.
Contribution
It introduces the buggy-code completion problem, creates two datasets with synthetic and real bugs, and evaluates the impact on Code-LLMs' performance.
Findings
Presence of potential bugs reduces Code-LLMs' accuracy by over 50%.
Performance drops significantly with even a single potential bug.
Post-hoc mitigation methods still leave a substantial performance gap.
Abstract
Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
