CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Kechi Zhang; Ge Li; Yihong Dong; Jingjing Xu; Jun Zhang; Jing Su; Yongfei Liu; Zhi Jin

arXiv:2410.05605·cs.SE·June 4, 2025

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, Zhi Jin

PDF

Open Access 3 Reviews

TL;DR

CodeDPO introduces a self-validation framework for code generation models that improves correctness and efficiency by leveraging a novel dataset construction and preference learning approach.

Contribution

It presents a scalable, self-generated dataset creation method that enhances code model training focusing on correctness and efficiency improvements.

Findings

01

Significant improvements in code correctness and efficiency on five benchmarks.

02

Effective self-validation mechanism using test case consensus.

03

Enhanced model performance in real-world code generation scenarios.

Abstract

Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

1. This paper focuses on an important topic, LLM-based code generation. 2. Experimental results show the proposed CodeDPO outperforms existing methods.

Weaknesses

This paper should be rejected for the following reasons: 1. The paper lacks many details, and the explanations of the experiments are insufficiently clear, making it difficult to understand. 2. The paper lacks sufficient novelty and rigor; it does not explain the costs associated with the proposed algorithm. 3. The writing is unrefined and not polished. Main Argument First, the paper is based on a variant of DPO and applies it to the code generation domain. However, it does not provide an expla

Reviewer 02Rating 5Confidence 5

Strengths

1. This work presents an interesting investigation of DPO in the domain of code generation. While there have been some similar investigations like Code-Optimise and PLUM, this work involves a couple of new ideas, including the PageRank-inspired algorithm and the integration of code efficiency into the optimization objective. 2. The authors compared CodeDPO with two similar approaches, Code-Optimise and PLUM, on multiple LLMs and code generation datasets. The results look promising. 3. The aut

Weaknesses

1. There is a potential fairness issue in the comparison with baseline methods. The finetuning dataset generated by CodeDPO includes 114K training samples in total. There is no description of the finetuning dataset sizes used in other baseline methods. For instance, if the authors used the original finetuning dataset generated by OSS-Instruct in Table 1, this would be unfair to OSS-Instruct since its dataset only includes 75K training samples. 2. This paper only describes the final dataset gen

Reviewer 03Rating 3Confidence 4

Strengths

+ The paper trains a code generation model with DPO with several pre-trained checkpoints, illustrating the effectiveness of functional-correctness-driven DPO for code generation. + The paper proposes a handful of research questions to evaluate CodeDPO on varied benchmarks and compare with multiple baselines

Weaknesses

__Few Novel Ideas Proposed and Few New Insights Concluded.__ The idea of trying DPO for code generation seems to be an intuitive combination for post-SFT phases of training, and such a combination has been tried in varied domains including code. For example, in the latest technical report of Llama-3.1, DPO has become the main algorithm for preference optimization, where they have empirically illustrated DPO's effectiveness in coding, math, natural language understanding, etc. Besides, Llama-3.1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Service-Oriented Architecture and Web Services