Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin; Marc Wetter

arXiv:2602.16729·cs.CR·April 24, 2026

Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter

PDF

TL;DR

This paper critically assesses adversarial safety datasets, revealing their overreliance on triggering cues and demonstrating that models deemed safe often become unsafe when these cues are removed, exposing a disconnect with real-world threats.

Contribution

The study introduces 'intent laundering,' a novel method to strip triggering cues from adversarial attacks, exposing the limitations of current safety datasets and evaluation practices.

Findings

01

Datasets overrely on triggering cues, which are unrealistic in real-world attacks.

02

Models considered safe become unsafe after removing triggering cues.

03

Intent laundering achieves high attack success rates, indicating safety evaluation flaws.

Abstract

We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.