ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Boyang Li; Hongzhe Shou; Yuanyuan Liang; Jingbin Zhang; Fang Zhou

arXiv:2604.12321·cs.CL·April 15, 2026

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou

PDF

1 Repo

TL;DR

ToxiTrace is a novel explainability method for Chinese toxicity detection that enhances toxic span identification and provides human-readable explanations while maintaining classification accuracy.

Contribution

It introduces a gradient-aligned training framework with three components to improve toxic span extraction and explanation quality in BERT-based models.

Findings

01

Improves toxic span extraction accuracy.

02

Enhances the coherence and readability of explanations.

03

Maintains efficient encoder-based inference.

Abstract

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/ArdLi/ToxiTrace
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.