Synthetic Data for any Differentiable Target
Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto

TL;DR
This paper introduces Dataset Policy Gradient (DPG), a reinforcement learning method that optimizes synthetic data generators to produce targeted examples, effectively controlling language model behaviors through synthetic training data.
Contribution
The paper presents DPG, a novel RL primitive that precisely optimizes synthetic data for targeted model behavior, enabling complex control without explicit prompts.
Findings
Targeted embedding of QR codes, patterns, and UUIDs in language models.
Ability to rephrase inputs in new languages using synthetic data.
Reduction of model's $ ext{l}^2$ norm through synthetic fine-tuning.
Abstract
What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern , and (3) have lower …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
