WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu; Qingfeng Sun; Kai Zheng; Xiubo Geng; Pu Zhao; Jiazhan Feng; Chongyang Tao; Qingwei Lin; Daxin Jiang

arXiv:2304.12244·cs.CL·May 28, 2025·108 cites

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang

PDF

Open Access 4 Repos 10 Models 5 Datasets 3 Reviews

TL;DR

This paper introduces Evol-Instruct, a method to automatically generate complex instruction data using LLMs, which enhances the fine-tuning of large language models like LLaMA to better follow instructions, rivaling some capabilities of ChatGPT.

Contribution

The paper presents a novel AI-driven data augmentation technique for instruction tuning, reducing manual effort and improving model performance on complex tasks.

Findings

01

Evol-Instruct-generated instructions outperform human-created ones in human evaluations.

02

WizardLM achieves over 90% of ChatGPT's capacity on multiple skills.

03

Fine-tuning with AI-evolved instructions is a promising approach for LLM enhancement.

Abstract

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The idea of using LLMs to rewrite and synthesize more complex instructions is interesting and intuitive. 2. The paper is overall well-written and easy to follow. 3. The authors evaluate WizardLM on a wide range of datasets/benchmarks and the experimental results look promising.

Weaknesses

1. The technical contribution of the proposed method is not very significant because compared to self-instruct, it is only adding the command for LLM to generate more complex instruction. The success of the proposed method largely depend on the abilities of powerful LLMs such as ChatGPT. While interesting and intuitive, I'm not sure the technical contribution of the manuscript is suitable for conferences such as ICLR. 2. The authors compare WizardLM with Vicuna by using the same amount of gener

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

The paper identifies an important problem: the instruction tuning data lacks diversity and difficulty. I agree this is indeed a problem. The reported evaluation results are good, and lends support to the claim that the new dataset is better than baselines.

Weaknesses

## Major Issues ## The authors did not perform a thorough evaluation of the created dataset. - The authors do not compare against diverse instruction tuning datasets such as Natural Instructions, Supernatural Instructions, and the training data of FLAN-T5. - The experiments focus on Llama and ignore other LLMs such as T5, Falcon, Mistral (which has a model not instruction tuned), and so on. - The evaluation of difficulty and diversity by ChatGPT is unconvincing, as the paper presents no evid

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- Innovative Approach: The Evol-Instruct method offers a fresh perspective on generating complex instructions without relying solely on human input. - Comprehensive Evaluation: The authors provide both automatic and human evaluations to assess the performance of WizardLM. - Broad Applicability: WizardLM's superior performance in varied benchmarks, including code, math, and general conversation, suggests its broad applicability. - Consideration of Instruction Complexity: The paper highlights the

Weaknesses

- Reliance on LLMs: The Evol-Instruct method's dependence on LLMs may introduce biases from the original training data. - Uncertainty in Evolution Direction: The random selection between In-depth and In-breadth evolving may not always yield the optimal instruction complexity. - Potential Overfitting: The process of evolving instructions multiple times might risk overfitting the model to specific types of instructions. - Paper Organization: This paper is not well-organized enough, especially for

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Label Smoothing · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Residual Connection