Learning Code Preference via Synthetic Evolution

Jiawei Liu; Thanh Nguyen; Mingyue Shang; Hantian Ding; Xiaopeng Li; Yu; Yu; Varun Kumar; Zijian Wang

arXiv:2410.03837·cs.LG·October 25, 2024

Learning Code Preference via Synthetic Evolution

Jiawei Liu, Thanh Nguyen, Mingyue Shang, Hantian Ding, Xiaopeng Li, Yu, Yu, Varun Kumar, Zijian Wang

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces CodeFavor, a framework for learning code preferences from synthetic data, and presents CodePrefBench, a benchmark for evaluating code preference models across properties like correctness, efficiency, and security.

Contribution

The paper proposes a novel synthetic evolution-based method for training code preference models and introduces a comprehensive benchmark for evaluation.

Findings

01

CodeFavor improves preference prediction accuracy by up to 28.8%.

02

Models can match larger models' performance while being more cost-effective.

03

Human preferences are costly and less effective for non-functional code properties.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

This paper presents a significant advancement in the field of code preference learning by introducing the CODEFAVOR framework, which innovatively utilizes synthetic evolution data to train models that predict meaningful code preferences. The novelty lies in its dual focus on aligning human and model preferences with verifiable code properties, addressing a critical gap in existing research. Key contributions include the development of CODEPREFBENCH, a comprehensive benchmark with 1364 curated ta

Weaknesses

The problem formulation/setting can be improved in terms of clarity, motivation, and realism. The framework is proposed to serve code assessment purposes, i.e., judging automatically which version of code generated by a model from a prompt is preferred (i.e. more correct/secure/efficient) between a pair of two versions. The questions are (1) in what scenarios would these two versions be available, and (2) how realistic it is that there are such strong and discriminative contrasts between the two

Reviewer 02Rating 6Confidence 3

Strengths

(1) The paper is well written and easy to follow. (2) The paper introduces a benchmark which can potentially be used by future papers. (3) The developed approach is evaluated using multiple LLMs, showing that the developed approach is generally effective. (4) The developed approach has good intuitions.

Weaknesses

(1) In Table 1, it seems that the approaches in the rows are either LLMs, or LLMs with the training framework developed in this paper. To validate the effectiveness of the developed training framework, might it be helpful to add some baseline training approaches which also train the LLMs using the same training data used by CODEFAVOR? (2) In Table 1, considering that there is still a gap between the Open-Weight Models and Our Models and Baselines (i.e., LLMs used with CODEFAVOR), might it be he

Reviewer 03Rating 5Confidence 4

Strengths

* This paper contributes two code preference synthetic dataset and a CODEPREFBENCH, a collection of 1,364 carefully curated preference tasks, To evaluate code preferences labeled by various approaches. * This paper comprehensively quantify and conduct case studies on code preferences derived from human developers and LLMs. * CODEFAVOR models can match the preference accuracy of models that are larger by 6∼9×, while being cheaper by 34×

Weaknesses

- The approach to synthetic data generation lacks originality, as creating datasets from git commits [1,6] and evolving from sampled code[2,3] are common practices in the field. - The pairwise modeling approach is also not particularly novel; using pairwise prompts, criterion-based prompting, and classification or generation labels [4,5,7] have been previously explored in other studies. - Additionally, there is concern that synthetic data generation may not fully ensure code correctness, as it h

Code & Models

Repositories

amazon-science/llm-code-preference
pytorchOfficial

Datasets

amazon/CodePrefBench
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Evolutionary Algorithms and Applications · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · ALIGN