PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Anne Collins, Maarten Sap, Sydney Levine

TL;DR
PluriHarms is a comprehensive benchmark that captures the spectrum of human judgments on AI harm, emphasizing disagreement and diversity in values to improve AI safety models.
Contribution
It introduces a scalable framework and dataset for studying human harm judgments across disagreement and harm axes, incorporating diverse human traits and prompt features.
Findings
Prompts related to imminent risks increase perceived harm.
Annotator traits influence disagreement in harm judgments.
Personalization improves AI safety model predictions.
Abstract
Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that…
Peer Reviews
Decision·ICLR 2026 Poster
Dataset (creation) * The process is well-explained. They achieve broad coverage and controlled variation across harm levels. * Through annotation with specialized safety rating models, they add semantic structure, linking prompts to human-understandable ethical dimensions. * Each participant rated all prompts. This allows within-subject comparisons of how individual traits affect harm judgments. * The correlation between synthetic harm levels and human ratings supports the validity of the synthe
Dataset (creation) * The dataset is limited in size, covers only English prompts, and considers limited demographic diversity * The focus on single prompts rather than (more realistic) full human–AI conversatinos/interactions. * The reliability of the model-based feature extraction (SafetyAnalyst, KALEIDO) is assumed instead of validating themselves, potential biases in those models could affect prompt selection and impact the dataset curation process. * The paper does not contain any details on
- The inclusion of deep annotator profiles, incorporating psychological measures like the Moral Foundations Questionnaire and Schwartz Value Survey, and structured prompt features from SafetyAnalyst and Kaleido provides a useful resource for analysis. This rich data enables the paper's core investigations into the drivers of disagreement, moving beyond simple demographics to the underlying values and psychological traits of annotators. - The paper provides strong empirical evidence for a central
- The paper's core methodological claim is the creation of a "calibrated" harm spectrum by prompting an LLM (DeepSeek) to generate variants along a 0.0 to 1.0 ordinal scale. This process implicitly assumes the LLM is a neutral tool for varying a single latent dimension of "harm." However, the LLM itself is a complex model that may introduce systematic stylistic artifacts (e.g., changes in syntax, vocabulary, or tone) that correlate with the requested harm level. These artifacts could act as conf
- Annotator demographics are included, and are relatively diverse across a number of demographic categories reported in Appendix D.3. - The paper argues that disagreement should be treated as diverse, legitimate viewpoints, which is an important principle for pluralism. - As a benchmark, PluriHarms is challenging; WildGuard only performs about as well as the random baseline. This demonstrates that there is still a gap for developing safety models that can be aligned to diverse viewpoints. - This
- PluriHarms has only 150 prompts, which is a small number compared to other datasets such as AIR-Bench 2024 with 5,692 prompts. - More information could have been provided in Section 2 about the genetic algorithm used to curate prompts. - Annotator recruitment was done on Prolific, thus the annotator sample is subject to the biases of the annotator population on that platform, and may not generalize to other populations.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Hate Speech and Cyberbullying Detection · Psychology of Moral and Emotional Judgment
