WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo; Qingfeng Sun; Can Xu; Pu Zhao; Jianguang Lou; Chongyang Tao; Xiubo Geng; Qingwei Lin; Shifeng Chen; Yansong Tang; Dongmei Zhang

arXiv:2308.09583·cs.CL·June 5, 2025·27 cites

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang

PDF

Open Access 1 Repo 10 Models 5 Datasets 3 Reviews

TL;DR

WizardMath enhances large language models' mathematical reasoning by using reinforcement learning with evolved instructions, significantly outperforming existing open-source models and even some proprietary ones on key benchmarks.

Contribution

Introduces WizardMath, a novel reinforcement learning approach that improves LLMs' math reasoning without external tools, leveraging instruction evolution and process supervision.

Findings

01

WizardMath-Mistral 7B surpasses top open-source LLMs in math reasoning.

02

WizardMath 70B outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro, and GPT-4-early-version.

03

Instruction evolution and process supervision are crucial for high math performance.

Abstract

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2,…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

**1. Strong Results** -- I put a lot of premium on this strength and use this to justify my overall rating. Many of the gains from training on Math Evol-Instruct are more than 10 points. More importantly, it is quite impressive to design something that outperforms strong proprietary models, so if this method is as strong as the paper claims, then this is something that the community will definitely quickly pick up on. **2. Thorough Experiments and Baseline Comparisons** -- Various scales rangin

Weaknesses

**1. PRM labels from GPT-4** -- Not really sure what to think of this. On one hand, I feel such direct distillation like this would limit the effectiveness of a method at larger data scales. On the other hand, the results seem to be good (and also this is one key part that makes the process fully AI-automated.) **2. Unclear presentation** -- The paper assumes that readers are already previously familiar with Evol-Instruct, as it devotes very little time to talking about it in the intro or relat

Reviewer 02Rating 8Confidence 3

Strengths

This paper is well-written and rich in detail. The introduction of the Instruction Reward Model (IRM) is novel and useful. The experiments are comprehensive, and the analysis is thorough, providing deep insights into the framework's effectiveness. I believe the idea of integrating IRM with PRM could be useful for any math LLMs (only undergoing SFT).

Weaknesses

The primary concern with this paper is the unfair comparison of baseline models in the results. While the authors claim that both supervised fine-tuning (SFT) with Math Evol-Instruct and reinforcement learning (RL) with the Instruction Reward Model (IRM) and Process Reward Model (PRM) are beneficial for enhancing mathematical reasoning, these approaches—SFT with synthesized data and the use of various reward models for RL—represent parallel research lines. - In Table 1, the authors compare thei

Reviewer 03Rating 8Confidence 3

Strengths

The paper is well-written, with comprehensive analysis and ablation experiments. It represents a valuable exploration of process supervision and IRM in math reasoning, achieving impressive performance. At the same time, their method is data efficient.

Weaknesses

NA

Code & Models

Repositories

nlpxucan/wizardlm
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Layer · Layer Normalization · {Dispute@FaQ-s}How to file a dispute with Expedia? · Absolute Position Encodings