Guardrail Baselines for Unlearning in LLMs

Pratiksha Thaker; Yash Maurya; Shengyuan Hu; Zhiwei Steven Wu,; Virginia Smith

arXiv:2403.03329·cs.CL·June 12, 2024·3 cites

Guardrail Baselines for Unlearning in LLMs

Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu,, Virginia Smith

PDF

Open Access 2 Repos

TL;DR

This paper compares lightweight guardrail methods like prompting and filtering to finetuning for unlearning concepts in large language models, showing they can achieve similar results more efficiently.

Contribution

It demonstrates that simple guardrail approaches can match finetuning in unlearning tasks and emphasizes the need for better evaluation metrics to distinguish their effectiveness.

Findings

01

Guardrail methods achieve comparable unlearning results to finetuning.

02

Prompting and filtering are more computationally efficient alternatives.

03

Existing metrics may not fully capture guardrail effectiveness.

Abstract

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods

MethodsSparse Evolutionary Training