Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models

Tassilo Klein; Moin Nabi

arXiv:2401.08491·cs.CL·June 2, 2025·1 cites

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models

Tassilo Klein, Moin Nabi

PDF

Open Access 1 Video

TL;DR

This paper introduces a contrastive perplexity framework for fine-tuning large language models to reduce toxic outputs, using adversarially generated hard negatives to improve safety without sacrificing task performance.

Contribution

It presents a novel contrastive perplexity objective leveraging hard negatives for implicit knowledge editing and controlled detoxification of LLMs.

Findings

01

Significantly reduces toxic content generation

02

Maintains strong performance on downstream tasks

03

Demonstrates robustness through adversarial hard negatives

Abstract

The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective. Central to our method is the construction of hard negatives - toxic outputs that are generated through adversarial paraphrasing to be semantically similar and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization. Experimental results in the domain of detoxification demonstrate that our method significantly reduces toxic generation while maintaining strong performance on downstream tasks such as commonsense reasoning and reading comprehension. Our findings highlight the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsContrastive Learning