Parameter-Efficient Detoxification with Contrastive Decoding
Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou

TL;DR
This paper introduces DETOXIGEN, a lightweight, inference-time detoxification method that uses contrastive decoding with a detoxifier trained on toxic data to steer language models away from undesirable, toxic outputs effectively.
Contribution
The paper presents a parameter-efficient, contrastive decoding approach that improves detoxification in language models by using a detoxifier trained on toxic data, requiring minimal additional resources.
Findings
Outperforms previous detoxification methods on REALTOXICITYPROMPTS benchmark.
Maintains high generation quality while reducing toxic outputs.
Requires only tiny extra weights, making it lightweight and practical.
Abstract
The field of natural language generation has witnessed significant advancements in recent years, including the development of controllable text generation techniques. However, controlling the attributes of the generated text remains a challenge, especially when aiming to avoid undesirable behavior such as toxicity. In this work, we introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles. DETOXIGEN is an ensemble of a pre-trained language model (generator) and a detoxifier. The detoxifier is trained intentionally on the toxic data representative of the undesirable attribute, encouraging it to generate text in that style exclusively. During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. This approach directly informs the…
Peer Reviews
Decision·Submitted to ICLR 2024
the strengths: - a lightweight framework that only requires toxic data for prompt tuning - superior performance among six baselines.
- I am not sure how much I appreciate the technical contribution of this work, it seems to me that both of the findings from the generator and the detoxifier part are using an existing method, so it is hard to convince myself the novelty. However, it indeed proves how the framework works in the detoxification field, this is definitely valuable. - the authors should show some qualitative examples to further back up table 2. - Only one benchmark dataset is used.
Originality: though this paper is not particularly original in its methods: it uses established NLP methods (contrastive decoding, soft-prompt tuning), it does apply them to non-toxic text generation which is fairly original. Quality: The experiments and idea are straightforward and simple. I view this as a strength, since anything more elaborate would only muddy the waters. Clarity: the paper itself is quite clearly presented, and I did not find any parts confusing. Significance: Since the meth
While I respect the author's choice of sticking to a small set of reasonably chosen design decisions, I would have liked to trade some of the comprehensiveness on the model-size experiments for a broader look at some other hyperparameters, such as the method for creating the *detox* model (there are both more effective efficient fine-tuning methods like LoRA, and cheaper, more straightforward non-fine-tuning methods like plain-old prompting).
* The authors show that their technique enables toxicity reduction at many model sizes and for both GPT-2/LLaMA model families * The technique is relatively straightforward and efficient
* The method seems like a pretty minor change from Liu et al 2021's DEXPERTS. As the authors note, their technique operate on the probabilities space, while the DEXPERTS technique operates in logits. Other than that, I can't find much difference. Their technique provides what looks like small gains over the DEXPERTS technique under their metrics. I would appreciate more analysis for why their formulation is preferable over DEXPERTS, and in which cases DEXPERTS might fail that their method would
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
