Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter

TL;DR
This paper critically assesses adversarial safety datasets, revealing their overreliance on triggering cues and demonstrating that models deemed safe often become unsafe when these cues are removed, exposing a disconnect with real-world threats.
Contribution
The study introduces 'intent laundering,' a novel method to strip triggering cues from adversarial attacks, exposing the limitations of current safety datasets and evaluation practices.
Findings
Datasets overrely on triggering cues, which are unrealistic in real-world attacks.
Models considered safe become unsafe after removing triggering cues.
Intent laundering achieves high attack success rates, indicating safety evaluation flaws.
Abstract
We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
