Introducing v0.5 of the AI Safety Benchmark from MLCommons

Bertie Vidgen; Adarsh Agrawal; Ahmed M. Ahmed; Victor Akinwande; Namir; Al-Nuaimi; Najla Alfaraj; Elie Alhajjar; Lora Aroyo; Trupti Bavalatti; Max; Bartolo; Borhane Blili-Hamelin; Kurt Bollacker; Rishi Bomassani; Marisa; Ferrara Boston; Sim\'eon Campos; Kal Chakra; Canyu Chen; Cody Coleman,; Zacharie Delpierre Coudert; Leon Derczynski; Debojyoti Dutta; Ian Eisenberg,; James Ezick; Heather Frase; Brian Fuller; Ram Gandikota; Agasthya; Gangavarapu; Ananya Gangavarapu; James Gealy; Rajat Ghosh; James Goel; Usman; Gohar; Sujata Goswami; Scott A. Hale; Wiebke Hutiri; Joseph Marvin Imperial,; Surgan Jandial; Nick Judd; Felix Juefei-Xu; Foutse Khomh; Bhavya Kailkhura,; Hannah Rose Kirk; Kevin Klyman; Chris Knotz; Michael Kuchnik; Shachi H.; Kumar; Srijan Kumar; Chris Lengerich; Bo Li; Zeyi Liao; Eileen Peters Long,; Victor Lu; Sarah Luger; Yifan Mai; Priyanka Mary Mammen; Kelvin Manyeki; Sean; McGregor; Virendra Mehta; Shafee Mohammed; Emanuel Moss; Lama Nachman; Dinesh; Jinenhally Naganna; Amin Nikanjam; Besmira Nushi; Luis Oala; Iftach Orr,; Alicia Parrish; Cigdem Patlak; William Pietri; Forough Poursabzi-Sangdeh,; Eleonora Presani; Fabrizio Puletti; Paul R\"ottger; Saurav Sahay; Tim Santos,; Nino Scherrer; Alice Schoenauer Sebag; Patrick Schramowski; Abolfazl; Shahbazi; Vin Sharma; Xudong Shen; Vamsi Sistla; Leonard Tang; Davide; Testuggine; Vithursan Thangarasa; Elizabeth Anne Watkins; Rebecca Weiss,; Chris Welty; Tyler Wilbers; Adina Williams; Carole-Jean Wu; Poonam Yadav,; Xianjun Yang; Yi Zeng; Wenhui Zhang; Fedor Zhdanov; Jiacheng Zhu; Percy; Liang; Peter Mattson; Joaquin Vanschoren

arXiv:2404.12241·cs.CL·May 15, 2024·5 cites

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir, Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max, Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa, Ferrara Boston, Sim\'eon Campos, Kal Chakra, Canyu Chen

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper presents v0.5 of the MLCommons AI Safety Benchmark, a structured tool to evaluate safety risks in chat-tuned language models, with detailed taxonomy, tests, and evaluation platform, setting the stage for future improvements.

Contribution

It introduces a principled approach to benchmark design, a hazard taxonomy, and an evaluation platform for assessing AI safety in chat models, with comprehensive documentation of limitations.

Findings

01

43,090 test items created for safety assessment

02

Benchmark evaluated over a dozen language models

03

Identifies key hazard categories and testing challenges

Abstract

This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlcommons/modelbench
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsSparse Evolutionary Training