Towards Operationalizing Right to Data Protection

Abhinav Java; Simra Shahid; Chirag Agarwal

arXiv:2411.08506·cs.LG·November 19, 2024

Towards Operationalizing Right to Data Protection

Abhinav Java, Simra Shahid, Chirag Agarwal

PDF

Open Access

TL;DR

This paper introduces RegText, a framework that creates unlearnable natural language datasets by adding imperceptible correlations, aiming to protect personal data from being learned by language models and addressing legal and ethical concerns.

Contribution

The paper presents a novel method for making text datasets unlearnable through imperceptible correlations, extending unlearnable data concepts from images to natural language.

Findings

01

RegText effectively prevents models like GPT-4o and Llama from learning on protected datasets.

02

Applying RegText reduces the test accuracy of large language models on unlearnable data.

03

The approach offers a potential tool for data protection and privacy in NLP.

Abstract

The widespread practice of indiscriminate data scraping to fine-tune language models (LMs) raises significant legal and ethical concerns, particularly regarding compliance with data protection laws such as the General Data Protection Regulation (GDPR). This practice often results in the unauthorized use of personal information, prompting growing debate within the academic and regulatory communities. Recent works have introduced the concept of generating unlearnable datasets (by adding imperceptible noise to the clean data), such that the underlying model achieves lower loss during training but fails to generalize to the unseen test setting. Though somewhat effective, these approaches are predominantly designed for images and are limited by several practical constraints like requiring knowledge of the target model. To this end, we introduce RegText, a framework that injects imperceptible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy, Security, and Data Protection

MethodsLLaMA