Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara; Siddhesh Sheth

arXiv:2603.18015·cs.CL·March 20, 2026

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara, Siddhesh Sheth

PDF

Open Access

TL;DR

This paper analyzes a neural harmful content detection model using explainability methods to uncover limitations and failure modes, emphasizing the importance of interpretability for human-in-the-loop moderation beyond mere accuracy metrics.

Contribution

It demonstrates how post-hoc explainability methods reveal model limitations and failure modes in harmful content detection, advocating for transparency over performance metrics.

Findings

01

Integrated Gradients provide diffuse contextual attributions.

02

Shapley Additive Explanations focus on explicit lexical cues.

03

Explainability exposes failure modes like indirect toxicity and political bias.

Abstract

Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Explainable Artificial Intelligence (XAI)