The Empirical Impact of Data Sanitization on Language Models

Anwesan Pal; Radhika Bhargava; Kyle Hinsz; Jacques Esterhuizen; and; Sudipta Bhattacharya

arXiv:2411.05978·cs.CL·November 12, 2024

The Empirical Impact of Data Sanitization on Language Models

Anwesan Pal, Radhika Bhargava, Kyle Hinsz, Jacques Esterhuizen, and, Sudipta Bhattacharya

PDF

Open Access

TL;DR

This paper empirically examines how data sanitization, especially redacting sensitive information, affects language model performance across various NLP tasks, revealing task-dependent impacts and proposing mitigation strategies.

Contribution

It provides a comprehensive analysis of data sanitization effects on language models, highlighting task-specific impacts and introducing methods to mitigate performance degradation.

Findings

01

Low impact (1-5%) on sentiment analysis and entailment tasks.

02

Significant performance drop (>25%) on comprehension Q&A tasks.

03

Proposed content-based subsampling to repair redacted datasets.

Abstract

Data sanitization in the context of language modeling involves identifying sensitive content, such as personally identifiable information (PII), and redacting them from a dataset corpus. It is a common practice used in natural language processing (NLP) to maintain privacy. Nevertheless, the impact of data sanitization on the language understanding capability of a language model remains less studied. This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks including comprehension question answering (Q&A), entailment, sentiment analysis, and text classification. Our experiments cover a wide spectrum comprising finetuning small-scale language models, to prompting large language models (LLMs), on both original and sanitized datasets, and comparing their performance across the tasks. Interestingly, our results suggest that for some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data