ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian, Kersting, Roberto Navigli, Huu Nguyen, Bo Li

TL;DR
ALERT is a large-scale benchmark designed to evaluate the safety of large language models through adversarial testing, using a detailed taxonomy to identify vulnerabilities and improve safety measures.
Contribution
The paper introduces ALERT, a comprehensive safety benchmark with a novel taxonomy for fine-grained risk assessment of LLMs, enabling in-depth safety evaluation and comparison.
Findings
Many LLMs still struggle to achieve acceptable safety levels.
The benchmark reveals specific vulnerabilities in popular models.
ALERT facilitates targeted improvements in LLM safety.
Abstract
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
