PerProb: Indirectly Evaluating Memorization in Large Language Models
Yihan Liao, Jacky Keung, Xiaoxue Ma, Jingyu Zhang, Yicheng Sun

TL;DR
PerProb introduces a standardized, label-free framework for indirectly assessing memorization and privacy risks in large language models by analyzing perplexity differences, applicable across various models and settings.
Contribution
We propose PerProb, a novel, unified method for evaluating LLM memorization without relying on labels or internal model access, addressing limitations of prior MIAs.
Findings
PerProb effectively detects memorization across multiple datasets.
Mitigation strategies like differential privacy reduce data leakage.
Memory behaviors vary significantly among different LLMs.
Abstract
The rapid advancement of Large Language Models (LLMs) has been driven by extensive datasets that may contain sensitive information, raising serious privacy concerns. One notable threat is the Membership Inference Attack (MIA), where adversaries infer whether a specific sample was used in model training. However, the true impact of MIA on LLMs remains unclear due to inconsistent findings and the lack of standardized evaluation methods, further complicated by the undisclosed nature of many LLM training sets. To address these limitations, we propose PerProb, a unified, label-free framework for indirectly assessing LLM memorization vulnerabilities. PerProb evaluates changes in perplexity and average log probability between data generated by victim and adversary models, enabling an indirect estimation of training-induced memory. Compared with prior MIA methods that rely on member/non-member…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
