Can Language Models be Instructed to Protect Personal Information?

Yang Chen; Ethan Mendes; Sauvik Das; Wei Xu; Alan Ritter

arXiv:2310.02224·cs.CL·October 4, 2023·5 cites

Can Language Models be Instructed to Protect Personal Information?

Yang Chen, Ethan Mendes, Sauvik Das, Wei Xu, Alan Ritter

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PrivQA, a benchmark for evaluating privacy-utility trade-offs in multimodal language models, and proposes a self-moderation technique to enhance privacy, while also revealing vulnerabilities to simple adversarial attacks.

Contribution

It presents PrivQA as a new benchmark for assessing privacy protections and introduces a self-moderation method to improve privacy in multimodal models.

Findings

01

Self-moderation significantly improves privacy protection.

02

Adversaries can bypass protections with simple jailbreaking methods.

03

PrivQA can aid in developing more robust privacy-preserving models.

Abstract

Large multimodal language models have proven transformative in numerous applications. However, these models have been shown to memorize and leak pre-training data, raising serious user privacy and information security concerns. While data leaks should be prevented, it is also crucial to examine the trade-off between the privacy protection and model utility of proposed approaches. In this paper, we introduce PrivQA -- a multimodal benchmark to assess this privacy/utility trade-off when a model is instructed to protect specific categories of personal information in a simulated scenario. We also propose a technique to iteratively self-moderate responses, which significantly improves privacy. However, through a series of red-teaming experiments, we find that adversaries can also easily circumvent these protections with simple jailbreaking methods through textual and/or image inputs. We…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

Pros: 1. This research presents an open benchmark designed to evaluate language and vision models on their ability to safeguard personal information by adhering to instructions. 2. The study introduces a self-moderation approach that enhances the proficiency of models in complying with access control directives, while also revealing persistent biases in the protection afforded to diverse groups. 3. The paper details a sequence of red-teaming exercises, highlighting that current advanced models c

Weaknesses

Cons: 1. The technical novelty is limited. This paper just tests whether or not the conventional instruction-tuned LLMs can protect privacy. The proposed “Self-Moderation” seems to slightly modify the previous “reflection” techniques in many previous works (there is a survey [1] on “reflection” techniques). 2. The title is misleading. The title is not very related to the core message of this paper because this paper does not conduct instruction tuning to protect privacy but just test whether or

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The paper looks at an important problem, as LLMs are bing used more and more and prompting and instruction tuning is ubiquitous. 2. I particularly like the visual aspect of the work and looking into multimodal models, as there aren't many existing works that focus on these models.

Weaknesses

1. The threat model of the paper is not at all clear, neither is the paper positioned well among prior work. What is the privacy definition? What are we trying to protect, is it training data? inference data? what is the actual application that the authors are targeting? what is the real world use-case? It seems like the authors are targeting training data, however, according to existing extraction attacks [1], this is not a realistic scenario and not a real problem. There is no successful ext

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The paper provides a valuable benchmark for privacy protection in language models, which is an emerging and important research area. There are few existing datasets that focus on privacy issues in language models. - The paper uses state-of-the-art models for the evaluation, which makes the results more relevant and convincing.

Weaknesses

- The paper does not share the code or data to reproduce the results, which limits the reproducibility and verifiability of the work. The paper says the URL is removed for review, but there are ways to share it anonymously (e.g. anonymous.4open.science). - The paper uses evaluation metrics that do not capture the severity of privacy breaches. Privacy is about preventing the worst-case scenarios, not the average ones. Therefore, privacy metrics should reflect that even a single leak of private da

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling