WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang

TL;DR
WizardMath enhances large language models' mathematical reasoning by using reinforcement learning with evolved instructions, significantly outperforming existing open-source models and even some proprietary ones on key benchmarks.
Contribution
Introduces WizardMath, a novel reinforcement learning approach that improves LLMs' math reasoning without external tools, leveraging instruction evolution and process supervision.
Findings
WizardMath-Mistral 7B surpasses top open-source LLMs in math reasoning.
WizardMath 70B outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro, and GPT-4-early-version.
Instruction evolution and process supervision are crucial for high math performance.
Abstract
Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2,…
Peer Reviews
Decision·ICLR 2025 Oral
**1. Strong Results** -- I put a lot of premium on this strength and use this to justify my overall rating. Many of the gains from training on Math Evol-Instruct are more than 10 points. More importantly, it is quite impressive to design something that outperforms strong proprietary models, so if this method is as strong as the paper claims, then this is something that the community will definitely quickly pick up on. **2. Thorough Experiments and Baseline Comparisons** -- Various scales rangin
**1. PRM labels from GPT-4** -- Not really sure what to think of this. On one hand, I feel such direct distillation like this would limit the effectiveness of a method at larger data scales. On the other hand, the results seem to be good (and also this is one key part that makes the process fully AI-automated.) **2. Unclear presentation** -- The paper assumes that readers are already previously familiar with Evol-Instruct, as it devotes very little time to talking about it in the intro or relat
This paper is well-written and rich in detail. The introduction of the Instruction Reward Model (IRM) is novel and useful. The experiments are comprehensive, and the analysis is thorough, providing deep insights into the framework's effectiveness. I believe the idea of integrating IRM with PRM could be useful for any math LLMs (only undergoing SFT).
The primary concern with this paper is the unfair comparison of baseline models in the results. While the authors claim that both supervised fine-tuning (SFT) with Math Evol-Instruct and reinforcement learning (RL) with the Instruction Reward Model (IRM) and Process Reward Model (PRM) are beneficial for enhancing mathematical reasoning, these approaches—SFT with synthesized data and the use of various reward models for RL—represent parallel research lines. - In Table 1, the authors compare thei
The paper is well-written, with comprehensive analysis and ablation experiments. It represents a valuable exploration of process supervision and IRM in math reasoning, achieving impressive performance. At the same time, their method is data efficient.
NA
Code & Models
- 🤗WizardLMTeam/WizardLM-13B-V1.0model· 184 dl· ♡ 74184 dl♡ 74
- 🤗WizardLMTeam/WizardCoder-15B-V1.0model· 321 dl· ♡ 763321 dl♡ 763
- 🤗TheBloke/WizardLM-13B-V1.1-GPTQmodel· 14 dl· ♡ 2714 dl♡ 27
- 🤗WizardLMTeam/WizardLM-13B-V1.2model· 1.8k dl· ♡ 2221.8k dl♡ 222
- 🤗TheBloke/WizardLM-13B-V1.2-GPTQmodel· 48 dl· ♡ 3548 dl♡ 35
- 🤗TheBloke/WizardLM-13B-V1.2-GGMLmodel· 5 dl· ♡ 555 dl♡ 55
- 🤗WizardLMTeam/WizardLM-70B-V1.0model· 18k dl· ♡ 23518k dl♡ 235
- 🤗TheBloke/WizardLM-70B-V1.0-GGMLmodel· 5 dl· ♡ 325 dl♡ 32
- 🤗TheBloke/WizardLM-70B-V1.0-GPTQmodel· 2.6k dl· ♡ 372.6k dl♡ 37
- 🤗WizardLMTeam/WizardMath-7B-V1.0model· 127 dl· ♡ 54127 dl♡ 54
- WizardLMTeam/WizardLM_evol_instruct_70kdataset· 1.3k dl1.3k dl
- WizardLMTeam/WizardLM_evol_instruct_V2_196kdataset· 3.5k dl3.5k dl
- nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196kdataset· 24 dl24 dl
- hkust-nlp/dart-math-pool-mathdataset· 310 dl310 dl
- hkust-nlp/dart-math-pool-gsm8kdataset· 56 dl56 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Layer · Layer Normalization · {Dispute@FaQ-s}How to file a dispute with Expedia? · Absolute Position Encodings
