LLM Robustness Leaderboard v1 --Technical report
Pierre Peign\'e - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

TL;DR
This report presents a comprehensive robustness evaluation of large language models using an automated red-teaming tool, revealing significant variability in vulnerability and proposing new metrics for detailed assessment.
Contribution
It introduces PRISM Eval BET, an automated adversarial system, and a fine-grained robustness metric, advancing the evaluation methodology for LLM safety and robustness.
Findings
Achieved 100% attack success rate against most models.
Discovered over 300-fold variation in attack difficulty across models.
Identified effective jailbreaking techniques for different hazards.
Abstract
This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
