Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
Mokshit Surana, Archit Rathod, Akshaj Satishkumar

TL;DR
This study evaluates DExperts, an inference-time technique for reducing toxicity in large language models, revealing high effectiveness against explicit toxicity but vulnerabilities with implicit hate speech and significant latency costs.
Contribution
It provides a comprehensive empirical assessment of DExperts, highlighting its strengths and limitations in mitigating both explicit and implicit toxicity in LLMs.
Findings
DExperts achieves 100% safety on explicit toxicity benchmarks.
The method's safety drops to 98.5% against implicit hate speech.
Latency increases tenfold, impacting real-time deployment.
Abstract
Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns from their training data. This leads to toxic degeneration where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
