LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan, Papyan

TL;DR
This paper explores the limitations of current semantic censorship methods for LLMs, revealing their undecidable nature and proposing a shift to security-based approaches to better mitigate risks.
Contribution
It demonstrates the theoretical limitations of semantic censorship in LLMs and advocates for treating censorship as a security problem rather than a purely machine learning challenge.
Findings
Semantic censorship is undecidable.
Attackers can reconstruct forbidden outputs from permissible ones.
Security-based approaches are necessary for effective mitigation.
Abstract
Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
