Can Safety Fine-Tuning Be More Principled? Lessons Learned from   Cybersecurity

David Williams-King; Linh Le; Adam Oberman; Yoshua Bengio

arXiv:2501.11183·cs.CR·January 22, 2025

Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio

PDF

Open Access

TL;DR

This paper critiques current safety fine-tuning methods for large language models, highlighting their reactive nature and proposing the adoption of more principled, security-oriented design approaches inspired by cybersecurity lessons.

Contribution

It draws parallels between LLM safety fine-tuning and cybersecurity arms races, advocating for proactive, principled safety mechanisms from the outset.

Findings

01

Current defenses are easily bypassed by new attacks

02

Reactive fine-tuning leads to an arms race with attackers

03

Principled, security-oriented design can improve safety

Abstract

As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certain model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Information and Cyber Security · Risk and Safety Analysis