Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis
Greta Dolcetti, Vincenzo Arceri, Eleonora Iotti, Sergio Maffeis, Agostino Cortesi, Enea Zaffanella

TL;DR
This paper introduces a framework that uses testing and static analysis to evaluate and improve code generated by open-source LLMs, highlighting their strengths and weaknesses in correctness and safety.
Contribution
The study demonstrates how feedback from testing and static analysis can guide LLMs to better evaluate and fix their generated code, advancing safe AI-assisted programming.
Findings
Models often produce incorrect and unsafe code.
Models perform poorly at detecting errors and vulnerabilities.
Models show strong ability to fix code when given feedback.
Abstract
Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
