The bitter lesson of misuse detection
Hadrien Mariaccia, Charbel-Rapha\"el Segerie, Diego Dorn

TL;DR
This paper introduces BELLS, a comprehensive benchmark for evaluating the effectiveness of supervision systems in detecting misuse and jailbreaks in large language models, revealing significant limitations of current specialized systems.
Contribution
The paper presents BELLS, a new two-dimensional benchmark that assesses supervision systems' performance across diverse attack types and harm levels, highlighting the need for generalist LLM capabilities.
Findings
Supervised systems have limited detection rates, especially against novel jailbreaks.
Generalist LLMs outperform specialized supervision systems in identifying harmful queries.
Simple yes/no questions to generalist LLMs often surpass market supervision systems.
Abstract
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection
