Benchmark Early and Red Team Often: A Framework for Assessing and   Managing Dual-Use Hazards of AI Foundation Models

Anthony M. Barrett; Krystal Jackson; Evan R. Murphy; Nada Madkour,; Jessica Newman

arXiv:2405.10986·cs.CR·May 21, 2024·2 cites

Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models

Anthony M. Barrett, Krystal Jackson, Evan R. Murphy, Nada Madkour,, Jessica Newman

PDF

Open Access

TL;DR

This paper proposes a combined approach using open benchmarks and closed red team evaluations to assess and manage the dual-use risks of AI foundation models, aiming for effective, resource-aware risk mitigation.

Contribution

It introduces a framework that leverages both open and closed evaluation methods to better identify and manage dual-use hazards in AI models.

Findings

01

Correlation between benchmark scores and red team evaluations suggests benchmarks can predict dual-use potential.

02

Frequent use of open benchmarks can inform safer model development.

03

Red team evaluations provide detailed insights into high-risk models.

Abstract

A concern about cutting-edge or "frontier" AI foundation models is that an adversary may use the models for preparing chemical, biological, radiological, nuclear, (CBRN), cyber, or other attacks. At least two methods can identify foundation models with potential dual-use capability; each has advantages and disadvantages: A. Open benchmarks (based on openly available questions and answers), which are low-cost but accuracy-limited by the need to omit security-sensitive details; and B. Closed red team evaluations (based on private evaluation by CBRN and cyber experts), which are higher-cost but can achieve higher accuracy by incorporating sensitive details. We propose a research and risk-management approach using a combination of methods including both open benchmarks and closed red team evaluations, in a way that leverages advantages of both methods. We recommend that one or more groups…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research

MethodsSparse Evolutionary Training