Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

Olivia Sturman; Aparna Joshi; Bhaktipriya Radharapu; Piyush Kumar,; Renee Shelby

arXiv:2409.13705·cs.CL·October 23, 2024

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

Olivia Sturman, Aparna Joshi, Bhaktipriya Radharapu, Piyush Kumar,, Renee Shelby

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight, post-processing ensemble method to reduce societal biases in text safety classifiers, improving fairness with minimal performance loss.

Contribution

It proposes a novel ensemble-based debiasing technique combined with counterfactual fairness metrics and new balanced datasets for safer language model outputs.

Findings

01

Enhanced counterfactual fairness in classifiers

02

Minimal impact on safety classifier performance

03

Effective bias mitigation using ensemble and FDW

Abstract

Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/counterfactual_fairness_evaluation_dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlood donation and transfusion practices · Hate Speech and Cyberbullying Detection · Spam and Phishing Detection