Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs
Julian Aron Prenner, Romain Robbes

TL;DR
This paper evaluates OpenAI's Codex model's ability to localize and fix bugs in code, demonstrating its surprising effectiveness in automated program repair tasks on the QuixBugs benchmark.
Contribution
It is the first to assess Codex's capability in bug fixing, showing competitive performance despite not being specifically trained for APR.
Findings
Codex is effective at bug localization and fixing.
It performs better on Python than Java.
Surprisingly competitive with state-of-the-art APR techniques.
Abstract
OpenAI's Codex, a GPT-3 like model trained on a large code corpus, has made headlines in and outside of academia. Given a short user-provided description, it is capable of synthesizing code snippets that are syntactically and semantically valid in most cases. In this work, we want to investigate whether Codex is able to localize and fix bugs, a task of central interest in the field of automated program repair. Our initial evaluation uses the multi-language QuixBugs benchmark (40 bugs in both Python and Java). We find that, despite not being trained for APR, Codex is surprisingly effective, and competitive with recent state of the art techniques. Our results also show that Codex is slightly more successful at repairing Python than Java.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Advanced Malware Detection Techniques
