ExpShield: Safeguarding Web Text from Unauthorized Crawling and LLM Exploitation

Ruixuan Liu; Toan Tran; Tianhao Wang; Hongsheng Hu; Shuo Wang; Li Xiong

arXiv:2412.21123·cs.CR·December 17, 2025

ExpShield: Safeguarding Web Text from Unauthorized Crawling and LLM Exploitation

Ruixuan Liu, Toan Tran, Tianhao Wang, Hongsheng Hu, Shuo Wang, Li Xiong

PDF

Open Access

TL;DR

ExpShield introduces a proactive, invisible perturbation method to protect web text from unauthorized use in training large language models, effectively reducing memorization and privacy risks without compromising readability.

Contribution

The paper presents a novel defense mechanism, ExpShield, that employs targeted perturbations and a new metric to mitigate text memorization in models, addressing limitations of existing protections.

Findings

01

Defense reduces membership inference attack success from 0.95 to 0.55 AUC.

02

Instance exploitation approaches near zero effectiveness.

03

Effective across language and vision-to-language models.

Abstract

As large language models increasingly memorize web-scraped training content, they risk exposing copyrighted or private information. Existing protections require compliance from crawlers or model developers, fundamentally limiting their effectiveness. We propose ExpShield, a proactive self-guard that mitigates memorization while maintaining readability via invisible perturbations, and we formulate it as a constrained optimization problem. Due to the lack of an individual-level risk metric for natural text, we first propose instance exploitation, a metric that measures how much training on a specific text increases the chance of guessing that text from a set of candidates-with zero indicating perfect defense. Directly solving the problem is infeasible for defenders without sufficient knowledge, thus we develop two effective proxy solutions: single-level optimization and synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Web Application Security Vulnerabilities · Access Control and Trust