Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach
Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin

TL;DR
This paper introduces a comprehensive benchmark dataset and a fine-tuned LLM, SafePhi, to evaluate and improve AI moderation, emphasizing a human-first approach for safer and more reliable content moderation.
Contribution
The paper presents a unified benchmark dataset with 49 categories and a fine-tuned LLM, SafePhi, outperforming existing moderators and highlighting the importance of human-in-the-loop strategies.
Findings
SafePhi achieved a Macro F1 score of 0.89, surpassing benchmarks.
LLM moderators underperform in nuanced moral and bias detection.
Heterogeneous data and human oversight are crucial for robustness.
Abstract
As AI systems become more integrated into daily life, the need for safer and more reliable moderation has never been greater. Large Language Models (LLMs) have demonstrated remarkable capabilities, surpassing earlier models in complexity and performance. Their evaluation across diverse tasks has consistently showcased their potential, enabling the development of adaptive and personalized agents. However, despite these advancements, LLMs remain prone to errors, particularly in areas requiring nuanced moral reasoning. They struggle with detecting implicit hate, offensive language, and gender biases due to the subjective and context-dependent nature of these issues. Moreover, their reliance on training data can inadvertently reinforce societal biases, leading to inconsistencies and ethical concerns in their outputs. To explore the limitations of LLMs in this role, we developed an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
