SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

TL;DR
SafetyAnalyst introduces an interpretable and steerable safety moderation framework for AI, utilizing structured harm-benefit analysis and aggregation to improve prompt safety classification over existing systems.
Contribution
The paper presents a novel, interpretable safety moderation method that uses chain-of-thought reasoning and harm-benefit trees to enhance AI safety alignment.
Findings
SafetyAnalyst achieves an average F1 of 0.81 on safety classification benchmarks.
It outperforms existing moderation systems with an average F1 below 0.72.
The framework offers interpretability, transparency, and steerability in safety moderation.
Abstract
The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis · Safety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning
MethodsALIGN · Sparse Evolutionary Training
