CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Leitian Tao; Xiang Chen; Tong Yu; Tung Mai; Ryan Rossi; Yixuan Li; Saayan Mitra

arXiv:2411.05199·cs.CL·June 27, 2025

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra

PDF

Open Access 5 Reviews

TL;DR

CodeLutra enhances small open-source LLMs for code generation by using preference-guided refinement on both successful and failed outputs, significantly improving accuracy without large datasets.

Contribution

Introduces a novel iterative refinement framework that leverages both correct and incorrect code attempts to improve small LLMs' performance.

Findings

01

Improved Llama-3-8B accuracy from 28.2% to 48.6%.

02

Approached GPT-4's performance on a data science task.

03

Achieved high-quality code generation without massive datasets.

Abstract

Large Language Models (LLMs) have revolutionized code generation but require significant resources and often over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective alternative. However, standard supervised approaches rely only on correct examples, missing valuable insights from failures. We introduce CodeLutra, a framework that leverages both correct and incorrect code attempts. Instead of using only correct solutions, CodeLutra applies iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This approach narrows the performance gap with state-of-the-art larger models without requiring massive datasets or auxiliary models. For instance, on a challenging data science coding task, using only 500 samples improved Llama-3-8B's accuracy from 28.2% to 48.6%,…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 4

Strengths

- CodeLutra simple yet effective method, with clear articulation of how it differs from related work. - Impressive results: with only 500 samples, CodeLutra achieves GPT-4-level performance on a base model with just 8 billion parameters. For the Spider benchmark, it improves base model performance from 59.3 to 74.4 in just four iterations, surpassing GPT-4’s 74.4. On BIRD, it increases performance from 22.3 to 42.6 in four iterations, approaching GPT-4’s 46.3. - Comprehensive evaluation, cover

Weaknesses

- Current evaluation focuses on SQL queries and data science problems, which are relatively short (from a few lines of code to several 10s of lines of code). It would be interesting to see how this approach generalizes to longer programs. - Limited exploration of scenarios without ground truth. In such cases, CodeLutra relies on syntactic error detection, but the results are, as expected, less impressive.

Reviewer 02Rating 5Confidence 4

Strengths

The paper is well-written. The proposed method with training with correct and failed generations iteratively makes sense. Experiments show good improvement on benchmarks.

Weaknesses

Some experimental setup is not clear enough, such as training data, SFT setting, and details of synthetically generated dataset. One of the contribution DPO and SFT loss is studied in previous literature. More experiments might be needed for comparing SFT then DPO with DPO+SFT loss.

Reviewer 03Rating 3Confidence 4

Strengths

NA

Weaknesses

1. The proposed method closely resembles that presented in [1]. Applying the same approach to a different scenario does not warrant publication, especially since this new scenario is simpler and benefits from execution feedback. [1] Iterative Reasoning Preference Optimization. https://arxiv.org/abs/2404.19733

Reviewer 04Rating 3Confidence 4

Strengths

* Comprehensive evaluation, ablation, and analysis support the effectiveness of the proposed method. In particular, the necessity of negative training samples and of SFT loss are both well studied. * The paper is well written and easy to follow.

Weaknesses

* The technical novelty of this paper is somewhat limited. L233-246 claimed two major points of novelty: refinement from execution feedback and dual loss mechanism. First, using feedbacks from program execution to iteratively refine code LLMs is a direction that has been extensively studied (e.g., CodeRL [1], and NExT [2]). However, these works are not discussed in the related work section. Second, the dual loss objective (i.e. adding SFT loss in DPO training) was proposed in [3], known as RPO,

Reviewer 05Rating 6Confidence 3

Strengths

1. The paper is well-organized and easy to follow 2. The proposed method can lead to a fine-tuned LLAMA3-8B model which has comparable performance to GPT-4. 3. The authors conduct comprehensive ablation studies that the effect of every component involved in their method is clearly demonstrated. 4. The method can still have good performance with limited annotations or training samples.

Weaknesses

1. Line 230 states that "The refinement process continues until the improvement between consecutive iteration becomes marginal". However, in the experiments, the authors seem to fix the iteration number to 4. In practice, how do you decide if the improvement between consecutive iterations is marginal? 2. The baseline setup is not clear enough and may not be comprehensive. a) For closed-source LLMs, it is unknown what prompting method is used. It is also not clearly stated what fine-tuning metho

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Rights Management and Security