Boundary-targeted Membership Inference Attacks on Safety Classifiers

Anthony Hughes; Alexander Goldberg; Prince Jha; Adam Perer; Nikolaos Aletras; Niloofar Mireshghallah

arXiv:2605.22373·cs.LG·May 22, 2026

Boundary-targeted Membership Inference Attacks on Safety Classifiers

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

PDF

TL;DR

This paper introduces a boundary-targeted membership inference attack on safety classifiers in generative AI, revealing privacy vulnerabilities and evaluating mitigation strategies.

Contribution

It proposes a novel boundary-targeted selection strategy for MIAs and demonstrates its effectiveness over existing methods in privacy leakage scenarios.

Findings

01

Adversaries can recover 19% of flagged conversations with 5% false positives.

02

The new attack is 3.5 times more effective than state-of-the-art MIAs.

03

Content filtering is ineffective; noise strategies can mitigate susceptibility.

Abstract

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.