Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models
Laksh Patel, Neel Shanbhag

TL;DR
This paper introduces a data-centric framework called GenDataCarto that identifies and mitigates memorization hotspots in generative models, reducing data leakage with minimal impact on performance.
Contribution
The paper presents a novel data cartography method that scores training samples for difficulty and memorization, guiding effective data pruning and weighting strategies.
Findings
Reduces synthetic canary extraction success by over 40% with 10% data pruning.
Increases validation perplexity by less than 0.5%.
Provides theoretical guarantees linking memorization scores to influence and generalization bounds.
Abstract
Modern generative models risk overfitting and unintentionally memorizing rare training examples, which can be extracted by adversaries or inflate benchmark performance. We propose Generative Data Cartography (GenDataCarto), a data-centric framework that assigns each pretraining sample a difficulty score (early-epoch loss) and a memorization score (frequency of ``forget events''), then partitions examples into four quadrants to guide targeted pruning and up-/down-weighting. We prove that our memorization score lower-bounds classical influence under smoothness assumptions and that down-weighting high-memorization hotspots provably decreases the generalization gap via uniform stability bounds. Empirically, GenDataCarto reduces synthetic canary extraction success by over 40\% at just 10\% data pruning, while increasing validation perplexity by less than 0.5\%. These results demonstrate that…
| Quadrant | Condition | Interpretation |
|---|---|---|
| Stable–Easy | low risk, well-learned | |
| Ambiguous–Hard | difficult, not memorized | |
| Hotspot–Memorized | easy but over-memorized | |
| Noisy–Outlier | hard and memorized |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Games · Topic Modeling
Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models
Laksh Patel
Neel Shanbhag
Abstract
Modern generative models risk overfitting and unintentionally memorizing rare training examples, which can be extracted by adversaries or inflate benchmark performance. We propose Generative Data Cartography (GenDataCarto), a data-centric framework that assigns each pretraining sample a difficulty score (early-epoch loss) and a memorization score (frequency of “forget events”), then partitions examples into four quadrants to guide targeted pruning and up-/down-weighting. We prove that our memorization score lower-bounds classical influence under smoothness assumptions and that down-weighting high-memorization hotspots provably decreases the generalization gap via uniform stability bounds. Empirically, GenDataCarto reduces synthetic canary extraction success by over 40% at just 10% data pruning, while increasing validation perplexity by less than 0.5%. These results demonstrate that principled data interventions can dramatically mitigate leakage with minimal cost to generative performance.
Generative Models, Data Cartography, Memorization Detection, Privacy Preservation, Uniform Stability, Influence Functions, Data-Centric Interventions, Forget Events
1 Introduction
Generative models have become a cornerstone of modern AI research, achieving unprecedented performance on a wide range of tasks from text completion and code synthesis to image and audio generation. Landmark works such as GPT-3 demonstrated that scaling language models to hundreds of billions of parameters yields emergent capabilities in few-shot learning and knowledge representation (Brown et al., 2020). Diffusion models similarly revolutionized image synthesis by framing generation as a gradual denoising process (Ho et al., 2020; Nichol and Dhariwal, 2021). Despite these breakthroughs, the immense scale and heterogeneity of pretraining corpora—often scraped indiscriminately from the web—pose serious risks relating to privacy, security, and scientific integrity.
Risks of Memorization and Leakage.
Neural networks can unintentionally memorize exact copies of rare or unique training examples, which adversaries can later extract via black-box or white-box attacks (Carlini et al., 2021; Kuang et al., 2021; Song and Mittal, 2022). Such leakage has been demonstrated not only for text but also for images (Carlini et al., 2023; Hayes and Shokri, 2021) and graph data (Sun et al., 2021). Relatedly, membership inference attacks exploit subtle distributional cues to determine whether a particular sample was used during training (Shokri et al., 2017; Yeom et al., 2018; Choquette-Choo and Klimov, 2021). In practice, even large-scale datasets like The Pile contain private or copyrighted passages that can surface verbatim in model outputs (Gao et al., 2022).
Benchmark Contamination and Overestimated Performance.
Generative models are frequently evaluated on benchmarks whose content inadvertently overlaps with training corpora (Zimmermann et al., 2022). Studies have shown that benchmark leakage can artificially inflate zero-shot and few-shot performance metrics (Kandpal et al., 2023), undermining the validity of widely reported scaling laws (Kaplan et al., 2020) and hampering reproducibility.
Model-Centric versus Data-Centric Defenses.
Model-centric defenses—differentially private training (Abadi et al., 2016; Papernot et al., 2018), modified objectives , and post-hoc output filters (Dubiński et al., 2024)—often incur utility trade-offs and significant engineering complexity. By contrast, data-centric strategies have proven effective in supervised settings: dataset cartography uses early-epoch loss and training variance to identify difficult or noisy examples (Swayamdipta et al., 2020; Gao et al., 2021), while influence functions estimate each sample’s impact on model parameters (Koh and Liang, 2017; Pruthi et al., 2020). Yet these techniques have not been systematically adapted to the unsupervised, sequential objectives of generative pretraining.
Our Contributions.
To bridge this gap, we introduce Generative Data Cartography (GenDataCarto), a framework that maps each pretraining example into a two-dimensional space defined by:
- •
Difficulty score : the mean per-sample loss over an initial burn-in period.
- •
Memorization score : the normalized count of “forget events,” where a sample’s loss rises above a small threshold after earlier fitting.
We prove that lower-bounds per-sample influence under standard smoothness and convexity assumptions (Bousquet and Elisseeff, 2002; Koh and Liang, 2017), and derive a uniform-stability bound showing that down-weighting high- examples reduces the expected generalization gap in proportion to the total pruned weight (Bousquet and Elisseeff, 2002; Mukherjee and Zhou, 2006). Empirically, GenDataCarto achieves:
- •
A reduction in synthetic “canary” extraction success for LSTM pretraining.
- •
A drop in GPT-2 memorization on Wikitext-103 at negligible perplexity cost.
By focusing on data dynamics rather than purely model internals, GenDataCarto offers a scalable, theoretically grounded toolkit for enhancing the safety and robustness of state-of-the-art generative models.
2 Preliminaries
Assumption 2.1** (Uniform Stability).**
The training algorithm is –uniformly stable: for any two datasets differing in one example, the change in loss on any test point is at most (Bousquet and Elisseeff, 2002).
Assumption 2.2** (Smoothness).**
Each per-sample loss is –smooth in , i.e.
[TABLE]
Assumption 2.3** (Convexity).**
Each loss is convex in , i.e.
[TABLE]
We begin by fixing notation, stating our learning objectives, and recalling key notions from stability and influence theory.
2.1 Training Objective and Notation
Let be the training set of i.i.d. examples drawn from an unknown population distribution . We train a generative model with parameters by minimizing the empirical negative log-likelihood
[TABLE]
Let be the random initialization. We perform epochs of mini-batch stochastic gradient descent with (possibly time-varying) stepsizes , yielding iterates
[TABLE]
We record the epoch-sample loss matrix
[TABLE]
This matrix underlies our data-centric analysis.
2.2 Generalization and Stability
Define the population risk , and the generalization gap
[TABLE]
A standard tool for bounding is uniform stability (Bousquet and Elisseeff, 2002).
Definition 2.4** (Uniform Stability).**
An algorithm mapping datasets to parameters is –uniformly stable if, for any two training sets differing in one example, and for all ,
[TABLE]
Under -stability, one shows and with high-probability bounds via McDiarmid’s inequality (McDiarmid, 1989).
2.3 Influence Functions
Influence functions estimate the effect of up-weighting one training point on the learned parameters or on predictions (Koh and Liang, 2017). For sufficiently smooth losses one may approximate the per-sample influence by the cumulative squared gradient norm:
[TABLE]
This quantity is costly to compute in deep models, motivating our more efficient proxy based on “forget events.”
—
3 Generative Data Cartography
We now introduce Generative Data Cartography, a method to map each training example into a two-dimensional plane of difficulty vs. memorization, enabling targeted data interventions.
3.1 Difficulty Score
Define a burn-in period . The difficulty score of is
[TABLE]
Intuitively, measures how hard is to fit during early training. We further examine its empirical distribution:
[TABLE]
where is a chosen percentile (e.g. 75%).
3.2 Memorization Score
Let be a small threshold (e.g. a fraction above the minimum achievable loss). A forget event for between epochs and occurs if
[TABLE]
We define the memorization score
[TABLE]
so captures the fraction of epochs in which is “rediscovered” after being forgotten. As with , let be the -percentile of .
3.3 Quadrant Partitioning
Each example maps to the point . We partition into four regions via thresholds :
3.4 Data-Centric Interventions
After labeling each with quadrant , we adjust the sampling distribution for the remaining epochs:
- •
Up-sample Ambiguous–Hard (1): increase sampling probability by factor to improve model robustness on rare but challenging patterns.
- •
Down-weight Hotspot–Memorized (2): multiply loss contribution by (or remove entirely) to mitigate over-memorization.
- •
Remove Noisy–Outliers (3): optionally drop from to eliminate corrupted or adversarial examples.
- •
Stable–Easy (0): keep or lightly up-sample to reinforce core patterns.
3.5 Algorithmic Outline
—
4 Theoretical Guarantees
We now formalize two central theorems: (i) down-weighting memorization hotspots reduces generalization gap under stability, and (ii) our memorization score lower-bounds classical influence.
4.1 Generalization Improvement via Stability
Theorem 4.1** (Generalization–Stability Bound).**
Under Assumption 2.1 (–uniform stability), suppose we decrease sampling weight by on each of the Hotspot–Memorized examples. Then the reduction in expected generalization gap satisfies
[TABLE]
Proof Sketch.
By uniform stability, up-weighting (or down-weighting) one example by changes the population loss by at most . Pruning examples by total weight thus lowers the gap by at least . ∎
4.2 Memorization Score as an Influence Proxy
Theorem 4.2** (Memorization–Influence Lower Bound).**
Under standard -smoothness and convexity assumptions (Bousquet and Elisseeff, 2002; Koh and Liang, 2017), and using SGD step-size , there exists a constant such that for every example :
[TABLE]
Proof Sketch.
A forget event between epochs and requires the loss to increase by
[TABLE]
By -smoothness (Bousquet and Elisseeff, 2002), we have
[TABLE]
Rearranging shows each forget event lower-bounds the squared gradient norm up to , and summing over epochs yields the stated result. ∎
Remark 4.3*.*
Theorem 4.1 ensures that our memorization score identifies high-influence examples and down-weighting provably tightens the generalization gap. In practice, this translates to measurable reductions in canary extraction success and membership inference attacks.
4.3 Experimental Results
To validate the efficacy of Generative Data Cartography, we conduct two main experiments:
1. Synthetic Canary Extraction Test.
We pretrain a small LSTM language model (Hochreiter and Schmidhuber, 1997) on a synthetic corpus augmented with unique “canary” sequences. Using GenDataCarto, we compute difficulty () and memorization () scores for each example and prune the top 5% highest- samples. Under this intervention, the canary extraction success rate drops from 100% to 40%, a 60% reduction at only a 0.5% increase in perplexity.
2. GPT-2 Pretraining on Wikitext-103.
We train GPT-2 Small (Radford et al., 2019) for 3 epochs on the Wikitext-103 dataset (Merity et al., 2017), injecting two distinct canaries. Applying GenDataCarto with and , we down-weight hotspot samples by a factor of 0.5. This yields:
- •
30% reduction in benchmark leakage (measured by recall of held-out validation sequences).
- •
15% reduction in membership-inference AUC.
- •
less than 1% perplexity increase, demonstrating minimal impact on model quality.
Figures 1 and 2 illustrate these trade-offs.
4.4 Implementation Details
Our public implementation integrates seamlessly with standard PyTorch training loops. Given per-sample losses, GenDataCarto adds only overhead for score computation and incurs an sort for pruning decisions. All code, hyperparameter settings, and data processing scripts are provided in the supplementary material.
5 Impact Statement
Generative Data Cartography (GenDataCarto) advances the safety and reliability of large-scale generative models by providing a principled, data-centric toolkit for identifying and mitigating memorization and leakage risks. By surgically down-weighting or pruning high-memorization “hotspot” examples, our method reduces the chance that sensitive or proprietary content will be inadvertently regurgitated—protecting individuals’ privacy and respecting copyright. At the same time, GenDataCarto imposes only minimal utility cost (sub-percent perplexity increases in practice), ensuring that model quality remains high. Moreover, our stability-based theoretical guarantees transparently quantify the trade-offs between data removal and generalization, supporting responsible deployment in domains such as healthcare, finance, and legal text generation. Finally, by exposing structurally important or noisy samples in massive pretraining corpora, GenDataCarto empowers data custodians and policymakers to audit and curate datasets, fostering greater accountability and trust in AI systems.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , 308–318.
- 2Bengio et al. (2009) Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th International Conference on Machine Learning , 41–48.
- 3Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems , 33, 1877–1901.
- 4Carlini et al. (2021) Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. (2021). Extracting training data from large language models. USENIX Security Symposium .
- 5Carlini et al. (2023) Carlini, N., Liu, C., Kos, J., Zhang, C., Bair, T., Kosman, N., & Savage, S. (2023). Extracting training data from diffusion models. ar Xiv preprint ar Xiv:2302.07826 .
- 6Choquette-Choo and Klimov (2021) Choquette-Choo, C., & Klimov, O. (2021). Label-only membership inference attacks. NDSS .
- 7Dubiński et al. (2024) Dubiński, M., Tramer, F., & Carlini, N. (2024). Training data attribution for large language models. ar Xiv preprint ar Xiv:2403.06187 .
- 8Dodge et al. (2022) Dodge, J., Ilharco, G., Min, S., Gardner, M., et al. (2022). Documenting training data of foundation models. Neur IPS Datasets and Benchmarks .
