$\textbf{PLUM}$: Improving Code LMs with Execution-Guided On-Policy Preference Learning Driven By Synthetic Test Cases
Dylan Zhang, Shizhe Diao, Xueyan Zou, Hao Peng

TL;DR
PLUM introduces an execution-guided, on-policy preference learning framework for code language models that leverages synthetic test cases to improve code generation accuracy without requiring reward models.
Contribution
It proposes a novel on-policy preference learning method using synthetic test cases, eliminating the need for reward models and enabling scalable, online preference data collection.
Findings
PLUM improves pass rates by up to 4.8% on standard benchmarks.
PLUM achieves an 11.8% increase on LiveCodeBench.
The approach is effective across various pre-trained code LMs.
Abstract
Preference learning provides a promising solution to address the limitations of supervised fine-tuning (SFT) for code language models, where the model is not explicitly trained to differentiate between correct and incorrect code. Recent findings demonstrate that on-policy data is the key to successful preference learning, where the preference data is collected using the same policy LM being trained. Inspired by this, we propose PLUM, an on-policy reference earning framework Agmented with test cases for code L s. The framework operates in three key stages: (1) automatic generation of test cases from natural language instructions, (2) creation of a preference data by evaluating candidate code solutions sampled from the policy, which can then be used to (3) train the policy LM. PLUM levitates the need to train reward models, allowing for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Testing and Debugging Techniques
MethodsShrink and Fine-Tune
