Fair and Calibrated Toxicity Detection with Robust Training and Abstention
Mokshit Surana

TL;DR
This paper evaluates fairness in toxicity detection, revealing that calibration disparities and abstention methods have complex interactions, and proposes a multi-axis fairness framework for better safety mechanisms.
Contribution
It systematically compares training interventions and post-hoc methods across fairness axes, highlighting their limitations and proposing a multi-axis fairness approach.
Findings
Calibration disparity is a hidden fairness violation.
Training interventions reshape disparity rather than eliminate it.
Post-hoc methods inherit training failure modes.
Abstract
Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs (). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration () but is significantly miscalibrated across all identity subgroups ( to ). (2) Training interventions reshape rather than eliminate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
