Boundary-targeted Membership Inference Attacks on Safety Classifiers
Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

TL;DR
This paper introduces a boundary-targeted membership inference attack on safety classifiers in generative AI, revealing privacy vulnerabilities and evaluating mitigation strategies.
Contribution
It proposes a novel boundary-targeted selection strategy for MIAs and demonstrates its effectiveness over existing methods in privacy leakage scenarios.
Findings
Adversaries can recover 19% of flagged conversations with 5% false positives.
The new attack is 3.5 times more effective than state-of-the-art MIAs.
Content filtering is ineffective; noise strategies can mitigate susceptibility.
Abstract
Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
