# Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models

**Authors:** Laksh Patel, Neel Shanbhag

arXiv: 2509.00083 · 2025-09-03

## TL;DR

This paper introduces a data-centric framework called GenDataCarto that identifies and mitigates memorization hotspots in generative models, reducing data leakage with minimal impact on performance.

## Contribution

The paper presents a novel data cartography method that scores training samples for difficulty and memorization, guiding effective data pruning and weighting strategies.

## Key findings

- Reduces synthetic canary extraction success by over 40% with 10% data pruning.
- Increases validation perplexity by less than 0.5%.
- Provides theoretical guarantees linking memorization scores to influence and generalization bounds.

## Abstract

Modern generative models risk overfitting and unintentionally memorizing rare training examples, which can be extracted by adversaries or inflate benchmark performance. We propose Generative Data Cartography (GenDataCarto), a data-centric framework that assigns each pretraining sample a difficulty score (early-epoch loss) and a memorization score (frequency of ``forget events''), then partitions examples into four quadrants to guide targeted pruning and up-/down-weighting. We prove that our memorization score lower-bounds classical influence under smoothness assumptions and that down-weighting high-memorization hotspots provably decreases the generalization gap via uniform stability bounds. Empirically, GenDataCarto reduces synthetic canary extraction success by over 40\% at just 10\% data pruning, while increasing validation perplexity by less than 0.5\%. These results demonstrate that principled data interventions can dramatically mitigate leakage with minimal cost to generative performance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00083/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00083/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/2509.00083/full.md

---
Source: https://tomesphere.com/paper/2509.00083