When to Invoke: Refining LLM Fairness with Toxicity Assessment

Jing Ren; Bowen Li; Ziqi Xu; Renqiang Luo; Shuo Yu; Xin Ye; Haytham Fayek; Xiaodong Li; Feng Xia

arXiv:2601.09250·cs.CL·January 15, 2026

When to Invoke: Refining LLM Fairness with Toxicity Assessment

Jing Ren, Bowen Li, Ziqi Xu, Renqiang Luo, Shuo Yu, Xin Ye, Haytham Fayek, Xiaodong Li, Feng Xia

PDF

Open Access

TL;DR

This paper introduces FairToT, an inference-time framework that improves fairness in LLM toxicity assessments by detecting and refining cases with demographic biases, without altering the underlying model.

Contribution

FairToT provides a novel prompt-guided approach with interpretable indicators to enhance fairness during inference, addressing bias issues without retraining models.

Findings

01

Reduces group-level disparities in toxicity judgments.

02

Maintains stable and reliable toxicity predictions.

03

Effective in benchmark datasets for fairness improvement.

Abstract

Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling