WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo; Can Xu; Pu Zhao; Qingfeng Sun; Xiubo Geng; Wenxiang Hu; Chongyang Tao; Jing Ma; Qingwei Lin; Daxin Jiang

arXiv:2306.08568·cs.CL·May 28, 2025·82 cites

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang

PDF

Open Access 4 Repos 10 Models 5 Datasets 3 Reviews

TL;DR

WizardCoder enhances code large language models with instruction fine-tuning, significantly improving their performance on code generation benchmarks and surpassing both open-source and some closed models.

Contribution

Introduces WizardCoder, a code LLM fine-tuned with Evol-Instruct, achieving state-of-the-art results on multiple code benchmarks.

Findings

01

Outperforms all open-source Code LLMs on key benchmarks.

02

Surpasses large closed LLMs like Claude and Bard on HumanEval.

03

Code, models, and data are publicly available.

Abstract

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Clever instruction finetuning idea on creating datasets synthetically using gpt-3.5 and a small set of seed tasks - Exhaustively tested on different programming languages, impressive performance gains using publicly available models (StarCoder and CodeLlama-34B) across several benchmarks. -HumanEval+, MBPP, MultPL-E, and DS-1000

Weaknesses

- While results on code benchmarks are impressive, novelty of the scientific methodology itself is quite limited as it is an adaptation of Evol-Instruct for Code. - Missing human assessment - It is not clear how useful the final fine-tuned model is outside the benchmarks that focus exclusively on functional correctness. Model hasn't been tested on developer productivity tasks like completion, code refinement. - Not clear if data leakage has been prevented. Does the evolved data or seed data over

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- The method works. It produced a top-performing open source code model that surpasses bigger and closed-source models in multiple open evaluations. This is the biggest strength of the paper and the value to the research community. - Paper provided extensive comparison with existing models and between different rounds of evol-instruct expansions.

Weaknesses

- Missing some key details. Where does the new coding solution come after expanding the instruction? Did we use the base model itself, or GPT-3 to generate them? Did we do any deduplication of the expanded instructions? Did we verify the quality of new instructions by executing the code generated? Update: based on the author's feedback, the coding solutions in the training data are from GPT-3.5-turbo. This presents a risk of data leakage, that WizardCoder becomes an implicitly distilled model o

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

$\mathtt{+}$ I think overall exploring ideas around how we can improve the efficacy of LLMs for different application is interesting. $\mathtt{+}$ Improving SOTA using the proposed instruction-tuning method is valuable and opens up new direction. The ablation studies further help to expand how and to what extend each technique helps (with some caveats that I will expand in the question section).

Weaknesses

$\mathtt{-}$ It is not clear how the authors came up with the list of heuristics for data evolution. This unclarity makes such approaches less applicable to wide range of tasks. $\mathtt{-}$ While the ablation studies in the main body provides some insights on the efficacy of the technique (additional clarification in the questions/recommendation section). $\mathtt{-}$ While the idea is interesting, but it seems very incremental compared to prior work and the contributions are limited.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research