Classifier-free guidance in LLMs Safety
Roman Smirnov

TL;DR
This paper introduces a novel method for unlearning in large language models using classifier-free guidance and reinforcement learning, enabling effective unlearning without retaining original datasets and maintaining model performance.
Contribution
It presents a new unlearning approach combining ORPO reinforcement learning with classifier-free guidance during inference, improving unlearning efficiency without degrading the model.
Findings
Significant improvement in unlearning performance
Effective unlearning without dataset retention
Maintains model quality after unlearning
Abstract
The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities
