BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of   LLM Safeguards

Diego Dorn; Alexandre Variengien; Charbel-Rapha\"el Segerie; Vincent; Corruble

arXiv:2406.01364·cs.CR·June 4, 2024·2 cites

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Diego Dorn, Alexandre Variengien, Charbel-Rapha\"el Segerie, Vincent, Corruble

PDF

Open Access

TL;DR

BELLS introduces a comprehensive, structured benchmark suite for evaluating Large Language Model safeguards across established, emerging, and future architectures, addressing a critical gap in safety assessment methodology.

Contribution

The paper presents BELLS, the first structured benchmark collection for LLM safeguards, including tests for current, novel, and future model architectures, with an interactive visualization tool.

Findings

01

Provides a standardized framework for evaluating LLM safeguards.

02

Includes the first next-gen architecture test using MACHIAVELLI.

03

Encourages development of more general and adaptable safeguards.

Abstract

Input-output safeguards are used to detect anomalies in the traces produced by Large Language Models (LLMs) systems. These detectors are at the core of diverse safety-critical applications such as real-time monitoring, offline evaluation of traces, and content moderation. However, there is no widely recognized methodology to evaluate them. To fill this gap, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organized into three categories: (1) established failure tests, based on already-existing benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; (2) emerging failure tests, to measure generalization to never-seen-before failure modes and encourage the development of more general safeguards; (3) next-gen architecture tests, for more complex scaffolding (such as LLM-agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear and radioactivity studies