Extracting alignment data in open models
Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veli\v{c}kovi\'c, Ilia Shumailov, Jamie Hayes

TL;DR
This paper demonstrates that significant alignment training data can be extracted from post-trained models using embedding-based similarity measures, revealing potential risks and implications for model training and distillation practices.
Contribution
It introduces a novel method for extracting alignment data via embeddings, showing that models regurgitate training data from post-training phases and that this data can be used to recover original performance.
Findings
Embedding models outperform string matching in data extraction.
Models readily regurgitate training data from post-training phases.
Extracted data can be used to retrain models, recovering performance.
Abstract
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of ) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Testing and Debugging Techniques
