Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness
Scott Blyth, Sherlock A. Licorish, Christoph Treude, and Markus Wagner

TL;DR
This paper presents a static analysis-driven feedback loop that iteratively improves LLM-generated code across multiple quality dimensions, significantly reducing security, readability, and reliability issues.
Contribution
It introduces an iterative prompting algorithm using static analysis tools to enhance code quality beyond correctness in LLM outputs.
Findings
Security issues reduced from >40% to 13%
Readability violations reduced from >80% to 11%
Reliability warnings reduced from >50% to 11%
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in code generation, achieving high scores on benchmarks such as HumanEval and MBPP. However, these benchmarks primarily assess functional correctness and neglect broader dimensions of code quality, including security, reliability, readability, and maintainability. In this work, we systematically evaluate the ability of LLMs to generate high-quality code across multiple dimensions using the PythonSecurityEval benchmark. We introduce an iterative static analysis-driven prompting algorithm that leverages Bandit and Pylint to identify and resolve code quality issues. Our experiments with GPT-4o show substantial improvements: security issues reduced from >40% to 13%, readability violations from >80% to 11%, and reliability warnings from >50% to 11% within ten iterations. These results demonstrate that LLMs, when guided by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Digital Rights Management and Security · Natural Language Processing Techniques
