Classifier-free guidance in LLMs Safety

Roman Smirnov

arXiv:2412.06846·cs.LG·December 11, 2024

Classifier-free guidance in LLMs Safety

Roman Smirnov

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for unlearning in large language models using classifier-free guidance and reinforcement learning, enabling effective unlearning without retaining original datasets and maintaining model performance.

Contribution

It presents a new unlearning approach combining ORPO reinforcement learning with classifier-free guidance during inference, improving unlearning efficiency without degrading the model.

Findings

01

Significant improvement in unlearning performance

02

Effective unlearning without dataset retention

03

Maintains model quality after unlearning

Abstract

The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rgsmirnov/cfg_safety_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities