Rethinking Data Synthesis: A Teacher Model Training Recipe with   Interpretation

Yifang Chen; David Zhu; Simon Du; Kevin Jamieson; Yang Liu

arXiv:2410.20362·cs.CL·December 10, 2024

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Yifang Chen, David Zhu, Simon Du, Kevin Jamieson, Yang Liu

PDF

Open Access

TL;DR

This paper introduces NOMAD, a novel training paradigm for models specifically designed for data synthesis, addressing limitations of traditional models optimized for question-answering, and demonstrates significant improvements in data generation quality.

Contribution

The paper proposes NOMAD, a new training approach for data generation models that differs from classical language models, with key techniques like no-prompt-masked training and optimal dataset sizing.

Findings

01

NOMAD achieves over 4% improvement on TriviaQA

02

NOMAD achieves over 2% improvement on GSM8K

03

Interprets synthetic data through relevance and novelty lenses

Abstract

Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Improvement

MethodsSparse Evolutionary Training · Focus