Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani

TL;DR
This paper introduces BSD, a benchmarking pipeline for evaluating language model defenses against covert misuse attacks that combine benign queries to achieve harmful goals.
Contribution
It develops a data generation pipeline and datasets to evaluate stateful defenses against covert misuse strategies in language models.
Findings
Decomposition attacks are effective misuse enablers.
Stateful defenses show promise in mitigating covert attacks.
New datasets challenge frontier models with difficult misuse scenarios.
Abstract
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because the individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
