Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

Xuan Li; Zhe Yin; Xiaodong Gu; Beijun Shen

arXiv:2505.01273·cs.CL·November 19, 2025

Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

Xuan Li, Zhe Yin, Xiaodong Gu, Beijun Shen

PDF

Open Access

TL;DR

PromptObfus introduces an anti-adversarial learning approach that desensitizes user prompts in large language models by masking sensitive words and replacing them with contextually appropriate alternatives, safeguarding privacy without sacrificing task accuracy.

Contribution

This paper presents PromptObfus, a novel prompt desensitization method using anti-adversarial learning to protect privacy in LLM prompts while maintaining performance.

Findings

01

Effectively prevents privacy inference from remote LLMs.

02

Maintains high task performance despite prompt desensitization.

03

Demonstrated on three NLP tasks with positive results.

Abstract

With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing privacy and sensitive data to the cloud LLMs. Traditional techniques like homomorphic encryption, secure multi-party computation, and federated learning face challenges due to heavy computational costs and user participation requirements, limiting their applicability in LLM scenarios. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is trained to generate candidate replacements for each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection