Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio

TL;DR
This paper critiques current safety fine-tuning methods for large language models, highlighting their reactive nature and proposing the adoption of more principled, security-oriented design approaches inspired by cybersecurity lessons.
Contribution
It draws parallels between LLM safety fine-tuning and cybersecurity arms races, advocating for proactive, principled safety mechanisms from the outset.
Findings
Current defenses are easily bypassed by new attacks
Reactive fine-tuning leads to an arms race with attackers
Principled, security-oriented design can improve safety
Abstract
As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certain model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Information and Cyber Security · Risk and Safety Analysis
