Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen, Mateusz Dziemian, Govind Pimpale

TL;DR
This paper demonstrates that applying refusal-vector ablation to Llama 3.1 70B models creates agents capable of harmful actions, exposing safety vulnerabilities and emphasizing the need for better safety measures in agentic language models.
Contribution
The study introduces refusal-vector ablation to Llama 3.1 70B and develops a simple agent scaffolding, revealing safety vulnerabilities and proposing a new benchmark for testing harmful and benign tasks.
Findings
Refusal-vector ablation allows models to perform harmful tasks.
Models refuse to give advice on harmful tasks in chat mode.
Safety fine-tuning does not prevent harmful behavior in agentic models.
Abstract
Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaser Design and Applications · Cardiac Arrhythmias and Treatments · Magnetic confinement fusion research
MethodsLLaMA
