Understanding Annotator Safety Policy with Interpretability
Alex Oesterling, Donghao Ren, Yannick Assogba, Dominik Moritz, Sunnie S.Y. Kim, Leon Gatys, Fred Hohman

TL;DR
This paper introduces Annotator Policy Models (APMs), interpretable models that reveal the underlying safety policies of annotators from their labeling behavior, aiding in understanding disagreement sources and improving safety policy design.
Contribution
The paper presents APMs that accurately model annotator safety policies from behavior alone, enabling analysis of ambiguity and value pluralism without extra annotation effort.
Findings
APMs achieve over 80% accuracy in modeling safety policies.
APMs can predict responses to counterfactual safety policy edits.
APMs uncover systematic differences in safety priorities across demographic groups.
Abstract
Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
