RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math   Reasoning by Eight-Fold

Amrith Setlur; Saurabh Garg; Xinyang Geng; Naman Garg; Virginia Smith,; Aviral Kumar

arXiv:2406.14532·cs.LG·June 21, 2024·1 cites

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith,, Aviral Kumar

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that using negative responses in synthetic data training significantly improves the efficiency and robustness of fine-tuning large language models for math reasoning, achieving eightfold gains.

Contribution

It introduces a novel per-step negative response scheme that enhances synthetic data training, unlearns spurious correlations, and is equivalent to advantage-weighted reinforcement learning.

Findings

01

Sampling more correct solutions doubles training efficiency.

02

Negative responses improve robustness and reduce spurious correlations.

03

Per-step negatives achieve 8x performance gains.

Abstract

Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $doubles$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ars22/scaling-LLM-math-synthetic-data
noneOfficial

Videos

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold· slideslive

Taxonomy

TopicsEducational Technology and Assessment