Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI   Responses

David Glukhov; Ziwen Han; Ilia Shumailov; Vardan Papyan; Nicolas; Papernot

arXiv:2407.02551·cs.CR·October 31, 2024

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas, Papernot

PDF

Open Access

TL;DR

This paper highlights that current safety measures for language models are insufficient against inferential attacks that extract dangerous knowledge, proposing a new evaluation framework and revealing inherent trade-offs between safety and utility.

Contribution

It introduces an information-theoretic threat model and a novel question-decomposition attack to better evaluate and understand risks of information leakage in safe AI responses.

Findings

01

Traditional defenses fail against inferential attacks extracting impermissible knowledge.

02

A new evaluation framework quantifies risks of information leakage in language models.

03

Safety-utility trade-off is inevitable when implementing information censorship.

Abstract

Vulnerability of Frontier language models to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals, and current defenses and evaluation methods fail to account for risks of dual-intent queries and their composition for malicious goals. To quantify these risks, we introduce a new safety evaluation framework based on impermissible information leakage of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of inferential adversaries, distinguished from security…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning