AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals
Stefan Pasch

TL;DR
This paper investigates how large language models evaluate content moderation refusals compared to human judgments, revealing that models tend to favor ethical refusals more than humans, highlighting potential biases in automated evaluation systems.
Contribution
It introduces the concept of moderation bias, showing that LLM-based evaluators systematically rate ethical refusals more favorably than humans, and analyzes implications for AI safety and evaluation practices.
Findings
LLMs evaluate ethical refusals more positively than humans.
A systematic moderation bias favors ethical refusals in model evaluations.
Differences are not observed for technical refusals.
Abstract
As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property · Legal Education and Practice Innovations
MethodsLLaMA
