PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

Jing-Jing Li; Joel Mire; Eve Fleisig; Valentina Pyatkin; Anne Collins; Maarten Sap; Sydney Levine

arXiv:2601.08951·cs.CY·February 4, 2026

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Anne Collins, Maarten Sap, Sydney Levine

PDF

Open Access 3 Reviews

TL;DR

PluriHarms is a comprehensive benchmark that captures the spectrum of human judgments on AI harm, emphasizing disagreement and diversity in values to improve AI safety models.

Contribution

It introduces a scalable framework and dataset for studying human harm judgments across disagreement and harm axes, incorporating diverse human traits and prompt features.

Findings

01

Prompts related to imminent risks increase perceived harm.

02

Annotator traits influence disagreement in harm judgments.

03

Personalization improves AI safety model predictions.

Abstract

Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Dataset (creation) * The process is well-explained. They achieve broad coverage and controlled variation across harm levels. * Through annotation with specialized safety rating models, they add semantic structure, linking prompts to human-understandable ethical dimensions. * Each participant rated all prompts. This allows within-subject comparisons of how individual traits affect harm judgments. * The correlation between synthetic harm levels and human ratings supports the validity of the synthe

Weaknesses

Dataset (creation) * The dataset is limited in size, covers only English prompts, and considers limited demographic diversity * The focus on single prompts rather than (more realistic) full human–AI conversatinos/interactions. * The reliability of the model-based feature extraction (SafetyAnalyst, KALEIDO) is assumed instead of validating themselves, potential biases in those models could affect prompt selection and impact the dataset curation process. * The paper does not contain any details on

Reviewer 02Rating 4Confidence 3

Strengths

- The inclusion of deep annotator profiles, incorporating psychological measures like the Moral Foundations Questionnaire and Schwartz Value Survey, and structured prompt features from SafetyAnalyst and Kaleido provides a useful resource for analysis. This rich data enables the paper's core investigations into the drivers of disagreement, moving beyond simple demographics to the underlying values and psychological traits of annotators. - The paper provides strong empirical evidence for a central

Weaknesses

- The paper's core methodological claim is the creation of a "calibrated" harm spectrum by prompting an LLM (DeepSeek) to generate variants along a 0.0 to 1.0 ordinal scale. This process implicitly assumes the LLM is a neutral tool for varying a single latent dimension of "harm." However, the LLM itself is a complex model that may introduce systematic stylistic artifacts (e.g., changes in syntax, vocabulary, or tone) that correlate with the requested harm level. These artifacts could act as conf

Reviewer 03Rating 8Confidence 4

Strengths

- Annotator demographics are included, and are relatively diverse across a number of demographic categories reported in Appendix D.3. - The paper argues that disagreement should be treated as diverse, legitimate viewpoints, which is an important principle for pluralism. - As a benchmark, PluriHarms is challenging; WildGuard only performs about as well as the random baseline. This demonstrates that there is still a gap for developing safety models that can be aligned to diverse viewpoints. - This

Weaknesses

- PluriHarms has only 150 prompts, which is a small number compared to other datasets such as AIR-Bench 2024 with 5,692 prompts. - More information could have been provided in Section 2 about the genetic algorithm used to curate prompts. - Annotator recruitment was done on Prolific, thus the annotator sample is subject to the biases of the annotator population on that platform, and may not generalize to other populations.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Hate Speech and Cyberbullying Detection · Psychology of Moral and Emotional Judgment