ALERT: A Comprehensive Benchmark for Assessing Large Language Models'   Safety through Red Teaming

Simone Tedeschi; Felix Friedrich; Patrick Schramowski; Kristian; Kersting; Roberto Navigli; Huu Nguyen; Bo Li

arXiv:2404.08676·cs.CL·June 25, 2024·2 cites

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian, Kersting, Roberto Navigli, Huu Nguyen, Bo Li

PDF

Open Access 2 Repos 5 Datasets

TL;DR

ALERT is a large-scale benchmark designed to evaluate the safety of large language models through adversarial testing, using a detailed taxonomy to identify vulnerabilities and improve safety measures.

Contribution

The paper introduces ALERT, a comprehensive safety benchmark with a novel taxonomy for fine-grained risk assessment of LLMs, enabling in-depth safety evaluation and comparison.

Findings

01

Many LLMs still struggle to achieve acceptable safety levels.

02

The benchmark reveals specific vulnerabilities in popular models.

03

ALERT facilitates targeted improvements in LLM safety.

Abstract

When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques