A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness

Naseem Machlovi; Maryam Saleki; Ruhul Amin; Mohamed Rahouti; Shawqi Al-Maliki; Junaid Qadir; Mohamed M. Abdallah; Ala Al-Fuqaha

arXiv:2601.03273·cs.CL·March 23, 2026

A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness

Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha

PDF

Open Access

TL;DR

This paper introduces GuardEval, a comprehensive benchmark dataset, and GGuard, a fine-tuned moderation model, to improve safety and robustness of language models against nuanced and adversarial content.

Contribution

It presents a new multi-perspective safety benchmark and a fine-tuned moderation model that significantly outperform existing systems in detecting nuanced unsafe content.

Findings

01

GGuard achieves a macro F1 score of 0.832, outperforming existing models.

02

Multi-perspective benchmarks improve moderation consistency.

03

Diverse data enhances safety and adversarial robustness.

Abstract

As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems that distinguish between naive and harmful requests while upholding appropriate censorship boundaries has never been greater. While existing LLMs can detect dangerous or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI