Synthetic Text Generation for Training Large Language Models via Gradient Matching

Dang Nguyen; Zeman Li; Mohammadhossein Bateni; Vahab Mirrokni; Meisam Razaviyayn; Baharan Mirzasoleiman

arXiv:2502.17607·cs.LG·June 10, 2025

Synthetic Text Generation for Training Large Language Models via Gradient Matching

Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman

PDF

Open Access 1 Video

TL;DR

This paper introduces a theoretically grounded method for generating synthetic text that guarantees convergence, privacy, and performance for fine-tuning large language models, addressing limitations of previous heuristic approaches.

Contribution

The authors propose a novel ADMM-based approach for synthetic text generation that provides formal guarantees for convergence, privacy, and model performance, unlike prior heuristic methods.

Findings

01

Effective synthetic text generation with convergence guarantees

02

Preserves privacy of original data during training

03

Improves fine-tuning performance on classification tasks

Abstract

Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Synthetic Text Generation for Training Large Language Models via Gradient Matching· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques