Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Mokshit Surana; Archit Rathod; Akshaj Satishkumar

arXiv:2605.14087·cs.CL·May 18, 2026

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Mokshit Surana, Archit Rathod, Akshaj Satishkumar

PDF

TL;DR

This study evaluates DExperts, an inference-time technique for reducing toxicity in large language models, revealing high effectiveness against explicit toxicity but vulnerabilities with implicit hate speech and significant latency costs.

Contribution

It provides a comprehensive empirical assessment of DExperts, highlighting its strengths and limitations in mitigating both explicit and implicit toxicity in LLMs.

Findings

01

DExperts achieves 100% safety on explicit toxicity benchmarks.

02

The method's safety drops to 98.5% against implicit hate speech.

03

Latency increases tenfold, impacting real-time deployment.

Abstract

Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns from their training data. This leads to toxic degeneration where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.