MAmmoTH2: Scaling Instructions from the Web
Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

TL;DR
This paper introduces MAmmoTH2, a method for efficiently harvesting 10 million instruction-response pairs from the web to improve large language models' reasoning abilities without relying on costly human annotation or GPT-4 data.
Contribution
It presents a novel paradigm for large-scale instruction data collection from web sources, significantly enhancing LLM reasoning performance.
Findings
MAmmoTH2-7B improves math reasoning accuracy from 11% to 36.7%.
MAmmoTH2-Plus achieves state-of-the-art results on reasoning benchmarks.
The approach reduces reliance on costly human or GPT-4 data for instruction tuning.
Abstract
Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TIGER-Lab/MAmmoTH2-8B-Plusmodel· 12k dl· ♡ 2212k dl♡ 22
- 🤗TIGER-Lab/MAmmoTH2-7B-Plusmodel· 12k dl· ♡ 712k dl♡ 7
- 🤗TIGER-Lab/MAmmoTH2-8Bmodel· 20 dl· ♡ 220 dl♡ 2
- 🤗TIGER-Lab/MAmmoTH2-7Bmodel· 17 dl17 dl
- 🤗TIGER-Lab/MAmmoTH2-8x7Bmodel· 13 dl13 dl
- 🤗TIGER-Lab/MAmmoTH2-8x7B-Plusmodel· 8.1k dl· ♡ 148.1k dl♡ 14
- 🤗RichardErkhov/TIGER-Lab_-_MAmmoTH2-7B-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/TIGER-Lab_-_MAmmoTH2-7B-8bitsmodel· 2 dl2 dl
- 🤗Zoyd/TIGER-Lab_MAmmoTH2-8x7B-Plus-3_0bpw_exl2model· 1 dl1 dl
- 🤗Zoyd/TIGER-Lab_MAmmoTH2-8x7B-Plus-3_5bpw_exl2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Adam
