SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Jing-Jing Li; Valentina Pyatkin; Max Kleiman-Weiner; Liwei Jiang; Nouha Dziri; Anne G. E. Collins; Jana Schaich Borg; Maarten Sap; Yejin Choi; Sydney Levine

arXiv:2410.16665·cs.CL·May 29, 2025

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

PDF

Open Access 2 Models 1 Datasets

TL;DR

SafetyAnalyst introduces an interpretable and steerable safety moderation framework for AI, utilizing structured harm-benefit analysis and aggregation to improve prompt safety classification over existing systems.

Contribution

The paper presents a novel, interpretable safety moderation method that uses chain-of-thought reasoning and harm-benefit trees to enhance AI safety alignment.

Findings

01

SafetyAnalyst achieves an average F1 of 0.81 on safety classification benchmarks.

02

It outperforms existing moderation systems with an average F1 below 0.72.

03

The framework offers interpretability, transparency, and steerability in safety moderation.

Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

jl3676/SafetyAnalystData
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Safety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning

MethodsALIGN · Sparse Evolutionary Training