Training and Evaluating Language Models with Template-based Data Generation
Yifan Zhang

TL;DR
This paper introduces Template-based Data Generation (TDG), a scalable method using GPT-4 to create vast, high-quality math problem datasets for training language models with improved reasoning abilities.
Contribution
The authors propose a novel paradigm that automates high-quality data generation using meta-templates, significantly expanding training resources for reasoning tasks.
Findings
Created over 7 million synthetic math problems with verifiable solutions.
Demonstrated improved reasoning skills in language models trained on the dataset.
Provided a scalable approach to address data scarcity in complex task training.
Abstract
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
