Let the Models Respond: Interpreting Language Model Detoxification   Through the Lens of Prompt Dependence

Daniel Scalena; Gabriele Sarti; Malvina Nissim; Elisabetta Fersini

arXiv:2309.00751·cs.CL·September 6, 2023

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini

PDF

Open Access 1 Repo

TL;DR

This paper investigates how different detoxification techniques affect language models' reliance on prompts, revealing differences in internal processes despite similar safety performance.

Contribution

It introduces a framework to analyze the internal effects of detoxification methods on language models using feature attribution.

Findings

01

Counter-narrative fine-tuning reduces prompt dependence more than reinforcement learning.

02

Detoxification methods influence models' internal decision processes.

03

Models show similar safety improvements despite different internal impacts.

Abstract

Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DanielSc4/RewardLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)

MethodsALIGN