Prompt Optimization for LLM Code Generation via Reinforcement Learning
Ali Mohammadi Esfahani, Nafiseh Kahani, Samuel A.Ajila

TL;DR
This paper introduces a reinforcement learning framework that optimizes prompts for large language models in code generation, significantly improving their accuracy and correctness across multiple benchmarks.
Contribution
The authors develop a novel RL-based prompt refinement method using a hybrid action space and shaped rewards, outperforming existing approaches in code generation tasks.
Findings
PPO-based prompt optimization improves Pass@1 scores on MBPP+.
The method outperforms EPiC, Reflexion, and Random-Hybrid baselines.
Functional correctness in code generation is enhanced through test-driven shaped rewards.
Abstract
Large Language Models (LLMs) can generate code from natural language, but their performance is highly sensitive to prompt formulation. We propose a reinforcement-learning-based framework that models prompt refinement as a sequential decision-making problem. A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback. We evaluate the framework on MBPP+, HumanEval+, and APPS using CodeT5+, CodeLLaMA, and DeepSeek-Coder as frozen code generators. On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%, respectively, outperforming EPiC, Reflexion, and Random-Hybrid. Soft-Pass@1 reaches 67.90%, 73.10%, and 88.20%, respectively. Similar improvements are observed on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
