MAmmoTH2: Scaling Instructions from the Web

Xiang Yue; Tuney Zheng; Ge Zhang; Wenhu Chen

arXiv:2405.03548·cs.CL·May 24, 2024·2 cites

MAmmoTH2: Scaling Instructions from the Web

Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

PDF

Open Access 10 Models 3 Datasets

TL;DR

This paper introduces MAmmoTH2, a method for efficiently harvesting 10 million instruction-response pairs from the web to improve large language models' reasoning abilities without relying on costly human annotation or GPT-4 data.

Contribution

It presents a novel paradigm for large-scale instruction data collection from web sources, significantly enhancing LLM reasoning performance.

Findings

01

MAmmoTH2-7B improves math reasoning accuracy from 11% to 36.7%.

02

MAmmoTH2-Plus achieves state-of-the-art results on reasoning benchmarks.

03

The approach reduces reliance on costly human or GPT-4 data for instruction tuning.

Abstract

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Adam