TL;DR
RedBench is a comprehensive, standardized dataset designed to evaluate large language models' robustness against adversarial prompts across multiple domains and risk categories.
Contribution
It introduces a unified dataset with a standardized taxonomy, addressing limitations of prior datasets and enabling systematic vulnerability assessment of LLMs.
Findings
RedBench aggregates 37 datasets with 29,362 samples.
It establishes baselines for modern LLMs.
The dataset and evaluation code are open-sourced.
Abstract
As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
