Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI   Safety Moderation Classifiers

Akshit Achara; Anshuman Chhabra

arXiv:2501.13302·cs.CL·January 24, 2025

Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers

Akshit Achara, Anshuman Chhabra

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates the fairness and robustness of four popular AI safety moderation classifiers, revealing potential disparities and sensitivities that could impact their reliability and fairness in content moderation.

Contribution

It provides a comprehensive analysis of fairness and robustness issues in widely-used ASM classifiers, highlighting areas for improvement.

Findings

01

Potential fairness gaps identified across classifiers.

02

Robustness varies with input perturbations.

03

Differences observed between classifiers and baseline models.

Abstract

AI Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms and to serve as guardrails that prevent Large Language Models (LLMs) from being fine-tuned on unsafe inputs. Owing to their potential for disparate impact, it is crucial to ensure that these classifiers: (1) do not unfairly classify content belonging to users from minority groups as unsafe compared to those from majority groups and (2) that their behavior remains robust and consistent across similar inputs. In this work, we thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers: OpenAI Moderation API, Perspective API, Google Cloud Natural Language (GCNL) API, and Clarifai API. We assess fairness using metrics such as demographic parity and conditional statistical parity, comparing their performance against ASM models and a fair-only baseline.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acharaakshit/fairmod
noneOfficial

Videos

Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers· underline

Taxonomy

TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)