Towards Operationalizing Right to Data Protection
Abhinav Java, Simra Shahid, Chirag Agarwal

TL;DR
This paper introduces RegText, a framework that creates unlearnable natural language datasets by adding imperceptible correlations, aiming to protect personal data from being learned by language models and addressing legal and ethical concerns.
Contribution
The paper presents a novel method for making text datasets unlearnable through imperceptible correlations, extending unlearnable data concepts from images to natural language.
Findings
RegText effectively prevents models like GPT-4o and Llama from learning on protected datasets.
Applying RegText reduces the test accuracy of large language models on unlearnable data.
The approach offers a potential tool for data protection and privacy in NLP.
Abstract
The widespread practice of indiscriminate data scraping to fine-tune language models (LMs) raises significant legal and ethical concerns, particularly regarding compliance with data protection laws such as the General Data Protection Regulation (GDPR). This practice often results in the unauthorized use of personal information, prompting growing debate within the academic and regulatory communities. Recent works have introduced the concept of generating unlearnable datasets (by adding imperceptible noise to the clean data), such that the underlying model achieves lower loss during training but fails to generalize to the unseen test setting. Though somewhat effective, these approaches are predominantly designed for images and are limited by several practical constraints like requiring knowledge of the target model. To this end, we introduce RegText, a framework that injects imperceptible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy, Security, and Data Protection
MethodsLLaMA
