Recovering from Privacy-Preserving Masking with Large Language Models
Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang,, Zeeshan Ahmed, Ozlem Kalinli

TL;DR
This paper explores using large language models to replace sensitive tokens in textual data, enabling privacy-preserving model adaptation with performance comparable to models trained on original data.
Contribution
It introduces LLM-based methods for token masking and empirically evaluates their effectiveness in privacy-preserving NLP model adaptation.
Findings
Models trained on obfuscated data perform comparably to those trained on original data.
LLM-based token substitution effectively balances privacy and model performance.
Empirical studies validate the proposed masking approaches across various datasets.
Abstract
Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Privacy-Preserving Technologies in Data
