Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Mokshit Surana

arXiv:2605.14074·cs.LG·May 15, 2026

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Mokshit Surana

PDF

TL;DR

This paper evaluates fairness in toxicity detection, revealing that calibration disparities and abstention methods have complex interactions, and proposes a multi-axis fairness framework for better safety mechanisms.

Contribution

It systematically compares training interventions and post-hoc methods across fairness axes, highlighting their limitations and proposing a multi-axis fairness approach.

Findings

01

Calibration disparity is a hidden fairness violation.

02

Training interventions reshape disparity rather than eliminate it.

03

Post-hoc methods inherit training failure modes.

Abstract

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ( $n = 1000$ ). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ( $0.013$ ) but is significantly miscalibrated across all identity subgroups ( $+ 0.029$ to $+ 0.134$ ). (2) Training interventions reshape rather than eliminate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.