WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang

TL;DR
WizardCoder enhances code large language models with instruction fine-tuning, significantly improving their performance on code generation benchmarks and surpassing both open-source and some closed models.
Contribution
Introduces WizardCoder, a code LLM fine-tuned with Evol-Instruct, achieving state-of-the-art results on multiple code benchmarks.
Findings
Outperforms all open-source Code LLMs on key benchmarks.
Surpasses large closed LLMs like Claude and Bard on HumanEval.
Code, models, and data are publicly available.
Abstract
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
Peer Reviews
Decision·ICLR 2024 poster
- Clever instruction finetuning idea on creating datasets synthetically using gpt-3.5 and a small set of seed tasks - Exhaustively tested on different programming languages, impressive performance gains using publicly available models (StarCoder and CodeLlama-34B) across several benchmarks. -HumanEval+, MBPP, MultPL-E, and DS-1000
- While results on code benchmarks are impressive, novelty of the scientific methodology itself is quite limited as it is an adaptation of Evol-Instruct for Code. - Missing human assessment - It is not clear how useful the final fine-tuned model is outside the benchmarks that focus exclusively on functional correctness. Model hasn't been tested on developer productivity tasks like completion, code refinement. - Not clear if data leakage has been prevented. Does the evolved data or seed data over
- The method works. It produced a top-performing open source code model that surpasses bigger and closed-source models in multiple open evaluations. This is the biggest strength of the paper and the value to the research community. - Paper provided extensive comparison with existing models and between different rounds of evol-instruct expansions.
- Missing some key details. Where does the new coding solution come after expanding the instruction? Did we use the base model itself, or GPT-3 to generate them? Did we do any deduplication of the expanded instructions? Did we verify the quality of new instructions by executing the code generated? Update: based on the author's feedback, the coding solutions in the training data are from GPT-3.5-turbo. This presents a risk of data leakage, that WizardCoder becomes an implicitly distilled model o
$\mathtt{+}$ I think overall exploring ideas around how we can improve the efficacy of LLMs for different application is interesting. $\mathtt{+}$ Improving SOTA using the proposed instruction-tuning method is valuable and opens up new direction. The ablation studies further help to expand how and to what extend each technique helps (with some caveats that I will expand in the question section).
$\mathtt{-}$ It is not clear how the authors came up with the list of heuristics for data evolution. This unclarity makes such approaches less applicable to wide range of tasks. $\mathtt{-}$ While the ablation studies in the main body provides some insights on the efficacy of the technique (additional clarification in the questions/recommendation section). $\mathtt{-}$ While the idea is interesting, but it seems very incremental compared to prior work and the contributions are limited.
Code & Models
- 🤗WizardLMTeam/WizardLM-13B-V1.0model· 184 dl· ♡ 74184 dl♡ 74
- 🤗WizardLMTeam/WizardCoder-15B-V1.0model· 321 dl· ♡ 763321 dl♡ 763
- 🤗Apoorvakoira/wizabcmodel· 23 dl· ♡ 123 dl♡ 1
- 🤗TheBloke/WizardLM-13B-V1.1-GPTQmodel· 14 dl· ♡ 2714 dl♡ 27
- 🤗GenerativeMagic/Llama-Engineer-Evol-7bmodel· 19 dl· ♡ 519 dl♡ 5
- 🤗WizardLMTeam/WizardLM-13B-V1.2model· 1.8k dl· ♡ 2221.8k dl♡ 222
- 🤗TheBloke/WizardLM-13B-V1.2-GPTQmodel· 48 dl· ♡ 3548 dl♡ 35
- 🤗TheBloke/WizardLM-13B-V1.2-GGMLmodel· 5 dl· ♡ 555 dl♡ 55
- 🤗Mediocreatmybest/WizardCoder-15B-V1.0_8bitmodel· 23 dl23 dl
- 🤗WizardLMTeam/WizardLM-70B-V1.0model· 18k dl· ♡ 23518k dl♡ 235
- WizardLMTeam/WizardLM_evol_instruct_70kdataset· 1.3k dl1.3k dl
- WizardLMTeam/WizardLM_evol_instruct_V2_196kdataset· 3.5k dl3.5k dl
- nickrosh/Evol-Instruct-Code-80k-v1dataset· 3.2k dl3.2k dl
- CodeResearch/Code-Evol-Instruct-OSSdataset· 178 dl178 dl
- nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196kdataset· 24 dl24 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
