Learning Code Preference via Synthetic Evolution
Jiawei Liu, Thanh Nguyen, Mingyue Shang, Hantian Ding, Xiaopeng Li, Yu, Yu, Varun Kumar, Zijian Wang

TL;DR
This paper introduces CodeFavor, a framework for learning code preferences from synthetic data, and presents CodePrefBench, a benchmark for evaluating code preference models across properties like correctness, efficiency, and security.
Contribution
The paper proposes a novel synthetic evolution-based method for training code preference models and introduces a comprehensive benchmark for evaluation.
Findings
CodeFavor improves preference prediction accuracy by up to 28.8%.
Models can match larger models' performance while being more cost-effective.
Human preferences are costly and less effective for non-functional code properties.
Abstract
Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper presents a significant advancement in the field of code preference learning by introducing the CODEFAVOR framework, which innovatively utilizes synthetic evolution data to train models that predict meaningful code preferences. The novelty lies in its dual focus on aligning human and model preferences with verifiable code properties, addressing a critical gap in existing research. Key contributions include the development of CODEPREFBENCH, a comprehensive benchmark with 1364 curated ta
The problem formulation/setting can be improved in terms of clarity, motivation, and realism. The framework is proposed to serve code assessment purposes, i.e., judging automatically which version of code generated by a model from a prompt is preferred (i.e. more correct/secure/efficient) between a pair of two versions. The questions are (1) in what scenarios would these two versions be available, and (2) how realistic it is that there are such strong and discriminative contrasts between the two
(1) The paper is well written and easy to follow. (2) The paper introduces a benchmark which can potentially be used by future papers. (3) The developed approach is evaluated using multiple LLMs, showing that the developed approach is generally effective. (4) The developed approach has good intuitions.
(1) In Table 1, it seems that the approaches in the rows are either LLMs, or LLMs with the training framework developed in this paper. To validate the effectiveness of the developed training framework, might it be helpful to add some baseline training approaches which also train the LLMs using the same training data used by CODEFAVOR? (2) In Table 1, considering that there is still a gap between the Open-Weight Models and Our Models and Baselines (i.e., LLMs used with CODEFAVOR), might it be he
* This paper contributes two code preference synthetic dataset and a CODEPREFBENCH, a collection of 1,364 carefully curated preference tasks, To evaluate code preferences labeled by various approaches. * This paper comprehensively quantify and conduct case studies on code preferences derived from human developers and LLMs. * CODEFAVOR models can match the preference accuracy of models that are larger by 6∼9×, while being cheaper by 34×
- The approach to synthetic data generation lacks originality, as creating datasets from git commits [1,6] and evolving from sampled code[2,3] are common practices in the field. - The pairwise modeling approach is also not particularly novel; using pairwise prompts, criterion-based prompting, and classification or generation labels [4,5,7] have been previously explored in other studies. - Additionally, there is concern that synthetic data generation may not fully ensure code correctness, as it h
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Evolutionary Algorithms and Applications · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · ALIGN
