A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness
Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha

TL;DR
This paper introduces GuardEval, a comprehensive benchmark dataset, and GGuard, a fine-tuned moderation model, to improve safety and robustness of language models against nuanced and adversarial content.
Contribution
It presents a new multi-perspective safety benchmark and a fine-tuned moderation model that significantly outperform existing systems in detecting nuanced unsafe content.
Findings
GGuard achieves a macro F1 score of 0.832, outperforming existing models.
Multi-perspective benchmarks improve moderation consistency.
Diverse data enhances safety and adversarial robustness.
Abstract
As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems that distinguish between naive and harmful requests while upholding appropriate censorship boundaries has never been greater. While existing LLMs can detect dangerous or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
