Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation
Yifang Chen, David Zhu, Simon Du, Kevin Jamieson, Yang Liu

TL;DR
This paper introduces NOMAD, a novel training paradigm for models specifically designed for data synthesis, addressing limitations of traditional models optimized for question-answering, and demonstrates significant improvements in data generation quality.
Contribution
The paper proposes NOMAD, a new training approach for data generation models that differs from classical language models, with key techniques like no-prompt-masked training and optimal dataset sizing.
Findings
NOMAD achieves over 4% improvement on TriviaQA
NOMAD achieves over 2% improvement on GSM8K
Interprets synthetic data through relevance and novelty lenses
Abstract
Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Assessment and Improvement
MethodsSparse Evolutionary Training · Focus
