Parameter-Efficient Detoxification with Contrastive Decoding

Tong Niu; Caiming Xiong; Semih Yavuz; Yingbo Zhou

arXiv:2401.06947·cs.CL·January 17, 2024·1 cites

Parameter-Efficient Detoxification with Contrastive Decoding

Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces DETOXIGEN, a lightweight, inference-time detoxification method that uses contrastive decoding with a detoxifier trained on toxic data to steer language models away from undesirable, toxic outputs effectively.

Contribution

The paper presents a parameter-efficient, contrastive decoding approach that improves detoxification in language models by using a detoxifier trained on toxic data, requiring minimal additional resources.

Findings

01

Outperforms previous detoxification methods on REALTOXICITYPROMPTS benchmark.

02

Maintains high generation quality while reducing toxic outputs.

03

Requires only tiny extra weights, making it lightweight and practical.

Abstract

The field of natural language generation has witnessed significant advancements in recent years, including the development of controllable text generation techniques. However, controlling the attributes of the generated text remains a challenge, especially when aiming to avoid undesirable behavior such as toxicity. In this work, we introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles. DETOXIGEN is an ensemble of a pre-trained language model (generator) and a detoxifier. The detoxifier is trained intentionally on the toxic data representative of the undesirable attribute, encouraging it to generate text in that style exclusively. During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. This approach directly informs the…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

the strengths: - a lightweight framework that only requires toxic data for prompt tuning - superior performance among six baselines.

Weaknesses

- I am not sure how much I appreciate the technical contribution of this work, it seems to me that both of the findings from the generator and the detoxifier part are using an existing method, so it is hard to convince myself the novelty. However, it indeed proves how the framework works in the detoxification field, this is definitely valuable. - the authors should show some qualitative examples to further back up table 2. - Only one benchmark dataset is used.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

Originality: though this paper is not particularly original in its methods: it uses established NLP methods (contrastive decoding, soft-prompt tuning), it does apply them to non-toxic text generation which is fairly original. Quality: The experiments and idea are straightforward and simple. I view this as a strength, since anything more elaborate would only muddy the waters. Clarity: the paper itself is quite clearly presented, and I did not find any parts confusing. Significance: Since the meth

Weaknesses

While I respect the author's choice of sticking to a small set of reasonably chosen design decisions, I would have liked to trade some of the comprehensiveness on the model-size experiments for a broader look at some other hyperparameters, such as the method for creating the *detox* model (there are both more effective efficient fine-tuning methods like LoRA, and cheaper, more straightforward non-fine-tuning methods like plain-old prompting).

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

* The authors show that their technique enables toxicity reduction at many model sizes and for both GPT-2/LLaMA model families * The technique is relatively straightforward and efficient

Weaknesses

* The method seems like a pretty minor change from Liu et al 2021's DEXPERTS. As the authors note, their technique operate on the probabilities space, while the DEXPERTS technique operates in logits. Other than that, I can't find much difference. Their technique provides what looks like small gains over the DEXPERTS technique under their metrics. I would appreciate more analysis for why their formulation is preferable over DEXPERTS, and in which cases DEXPERTS might fail that their method would

Videos

Parameter-Efficient Detoxification with Contrastive Decoding· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research