Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers
Akshit Achara, Anshuman Chhabra

TL;DR
This paper evaluates the fairness and robustness of four popular AI safety moderation classifiers, revealing potential disparities and sensitivities that could impact their reliability and fairness in content moderation.
Contribution
It provides a comprehensive analysis of fairness and robustness issues in widely-used ASM classifiers, highlighting areas for improvement.
Findings
Potential fairness gaps identified across classifiers.
Robustness varies with input perturbations.
Differences observed between classifiers and baseline models.
Abstract
AI Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms and to serve as guardrails that prevent Large Language Models (LLMs) from being fine-tuned on unsafe inputs. Owing to their potential for disparate impact, it is crucial to ensure that these classifiers: (1) do not unfairly classify content belonging to users from minority groups as unsafe compared to those from majority groups and (2) that their behavior remains robust and consistent across similar inputs. In this work, we thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers: OpenAI Moderation API, Perspective API, Google Cloud Natural Language (GCNL) API, and Clarifai API. We assess fairness using metrics such as demographic parity and conditional statistical parity, comparing their performance against ASM models and a fair-only baseline.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
