Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence
Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini

TL;DR
This paper investigates how different detoxification techniques affect language models' reliance on prompts, revealing differences in internal processes despite similar safety performance.
Contribution
It introduces a framework to analyze the internal effects of detoxification methods on language models using feature attribution.
Findings
Counter-narrative fine-tuning reduces prompt dependence more than reinforcement learning.
Detoxification methods influence models' internal decision processes.
Models show similar safety improvements despite different internal impacts.
Abstract
Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)
MethodsALIGN
