Extracting alignment data in open models

Federico Barbero; Xiangming Gu; Christopher A. Choquette-Choo; Chawin Sitawarin; Matthew Jagielski; Itay Yona; Petar Veli\v{c}kovi\'c; Ilia Shumailov; Jamie Hayes

arXiv:2510.18554·cs.AI·October 27, 2025

Extracting alignment data in open models

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veli\v{c}kovi\'c, Ilia Shumailov, Jamie Hayes

PDF

Open Access

TL;DR

This paper demonstrates that significant alignment training data can be extracted from post-trained models using embedding-based similarity measures, revealing potential risks and implications for model training and distillation practices.

Contribution

It introduces a novel method for extracting alignment data via embeddings, showing that models regurgitate training data from post-training phases and that this data can be used to recover original performance.

Findings

01

Embedding models outperform string matching in data extraction.

02

Models readily regurgitate training data from post-training phases.

03

Extracted data can be used to retrain models, recovering performance.

Abstract

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10 \times$ ) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Testing and Debugging Techniques