$\textbf{PLUM}$: Improving Code LMs with Execution-Guided On-Policy   Preference Learning Driven By Synthetic Test Cases

Dylan Zhang; Shizhe Diao; Xueyan Zou; Hao Peng

arXiv:2406.06887·cs.CL·October 15, 2024

$\textbf{PLUM}$: Improving Code LMs with Execution-Guided On-Policy Preference Learning Driven By Synthetic Test Cases

Dylan Zhang, Shizhe Diao, Xueyan Zou, Hao Peng

PDF

Open Access

TL;DR

PLUM introduces an execution-guided, on-policy preference learning framework for code language models that leverages synthetic test cases to improve code generation accuracy without requiring reward models.

Contribution

It proposes a novel on-policy preference learning method using synthetic test cases, eliminating the need for reward models and enabling scalable, online preference data collection.

Findings

01

PLUM improves pass rates by up to 4.8% on standard benchmarks.

02

PLUM achieves an 11.8% increase on LiveCodeBench.

03

The approach is effective across various pre-trained code LMs.

Abstract

Preference learning provides a promising solution to address the limitations of supervised fine-tuning (SFT) for code language models, where the model is not explicitly trained to differentiate between correct and incorrect code. Recent findings demonstrate that on-policy data is the key to successful preference learning, where the preference data is collected using the same policy LM being trained. Inspired by this, we propose PLUM, an on-policy $P$ reference $L$ earning framework A $u$ gmented with test cases for code L $M$ s. The framework operates in three key stages: (1) automatic generation of test cases from natural language instructions, (2) creation of a preference data by evaluating candidate code solutions sampled from the policy, which can then be used to (3) train the policy LM. PLUM levitates the need to train reward models, allowing for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Software Testing and Debugging Techniques

MethodsShrink and Fine-Tune