Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
Tassilo Klein, Moin Nabi

TL;DR
This paper introduces a contrastive perplexity framework for fine-tuning large language models to reduce toxic outputs, using adversarially generated hard negatives to improve safety without sacrificing task performance.
Contribution
It presents a novel contrastive perplexity objective leveraging hard negatives for implicit knowledge editing and controlled detoxification of LLMs.
Findings
Significantly reduces toxic content generation
Maintains strong performance on downstream tasks
Demonstrates robustness through adversarial hard negatives
Abstract
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective. Central to our method is the construction of hard negatives - toxic outputs that are generated through adversarial paraphrasing to be semantically similar and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization. Experimental results in the domain of detoxification demonstrate that our method significantly reduces toxic generation while maintaining strong performance on downstream tasks such as commonsense reasoning and reading comprehension. Our findings highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsContrastive Learning
