Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

arXiv:2411.18104·cs.CL·May 15, 2026

Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces Template-based Data Generation (TDG), a scalable method using GPT-4 to create vast, high-quality math problem datasets for training language models with improved reasoning abilities.

Contribution

The authors propose a novel paradigm that automates high-quality data generation using meta-templates, significantly expanding training resources for reasoning tasks.

Findings

01

Created over 7 million synthetic math problems with verifiable solutions.

02

Demonstrated improved reasoning skills in language models trained on the dataset.

03

Provided a scalable approach to address data scarcity in complex task training.

Abstract

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iiis-ai/TemplateMath
github

Datasets

math-ai/TemplateGSM
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.