LLM Censorship: A Machine Learning Challenge or a Computer Security   Problem?

David Glukhov; Ilia Shumailov; Yarin Gal; Nicolas Papernot; Vardan; Papyan

arXiv:2307.10719·cs.AI·July 25, 2023·20 cites

LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan, Papyan

PDF

Open Access

TL;DR

This paper explores the limitations of current semantic censorship methods for LLMs, revealing their undecidable nature and proposing a shift to security-based approaches to better mitigate risks.

Contribution

It demonstrates the theoretical limitations of semantic censorship in LLMs and advocates for treating censorship as a security problem rather than a purely machine learning challenge.

Findings

01

Semantic censorship is undecidable.

02

Attackers can reconstruct forbidden outputs from permissible ones.

03

Security-based approaches are necessary for effective mitigation.

Abstract

Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling