Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu,, Virginia Smith

TL;DR
This paper compares lightweight guardrail methods like prompting and filtering to finetuning for unlearning concepts in large language models, showing they can achieve similar results more efficiently.
Contribution
It demonstrates that simple guardrail approaches can match finetuning in unlearning tasks and emphasizes the need for better evaluation metrics to distinguish their effectiveness.
Findings
Guardrail methods achieve comparable unlearning results to finetuning.
Prompting and filtering are more computationally efficient alternatives.
Existing metrics may not fully capture guardrail effectiveness.
Abstract
Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
MethodsSparse Evolutionary Training
