BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn, Alexandre Variengien, Charbel-Rapha\"el Segerie, Vincent, Corruble

TL;DR
BELLS introduces a comprehensive, structured benchmark suite for evaluating Large Language Model safeguards across established, emerging, and future architectures, addressing a critical gap in safety assessment methodology.
Contribution
The paper presents BELLS, the first structured benchmark collection for LLM safeguards, including tests for current, novel, and future model architectures, with an interactive visualization tool.
Findings
Provides a standardized framework for evaluating LLM safeguards.
Includes the first next-gen architecture test using MACHIAVELLI.
Encourages development of more general and adaptable safeguards.
Abstract
Input-output safeguards are used to detect anomalies in the traces produced by Large Language Models (LLMs) systems. These detectors are at the core of diverse safety-critical applications such as real-time monitoring, offline evaluation of traces, and content moderation. However, there is no widely recognized methodology to evaluate them. To fill this gap, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organized into three categories: (1) established failure tests, based on already-existing benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; (2) emerging failure tests, to measure generalization to never-seen-before failure modes and encourage the development of more general safeguards; (3) next-gen architecture tests, for more complex scaffolding (such as LLM-agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear and radioactivity studies
