Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility
Tom Segal, Asaf Shabtai, Yuval Elovici

TL;DR
This paper introduces SCP-Δr, a novel method that enhances privacy of fine-tuned LLMs against data extraction attacks while maintaining high utility, through smoothing low-impact tokens based on probability shifts.
Contribution
It proposes SCP-Δr, a new NAF-based algorithm that offers stronger theoretical privacy guarantees and empirical protection with minimal utility degradation.
Findings
SCP-Δr significantly reduces data extraction risks.
The method maintains model utility with minimal performance loss.
It outperforms existing approaches in theoretical privacy bounds.
Abstract
Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP- achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Topic Modeling
