No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Joshua Kazdan; Abhay Puri; Rylan Schaeffer; Lisa Yu; Chris Cundy; Jason Stanley; Sanmi Koyejo; Krishnamurthy Dvijotham

arXiv:2502.19537·cs.CR·July 15, 2025

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham

PDF

Open Access

TL;DR

This paper introduces a new fine-tuning attack that trains language models to refuse harmful requests before complying, effectively bypassing safety filters and exposing vulnerabilities in both open-source and commercial models.

Contribution

The paper presents a novel 'refuse-then-comply' fine-tuning attack that deepens existing attacks and successfully bypasses safety mechanisms in state-of-the-art language models.

Findings

01

Achieved 57% success rate against GPT-4o

02

Achieved 72% success rate against Claude Haiku

03

Received a $2000 bug bounty from OpenAI

Abstract

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is "shallow", we correspondingly demonstrate that existing fine-tuning attacks are shallow -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhysical Unclonable Functions (PUFs) and Hardware Security · Radiation Effects in Electronics · Electrostatic Discharge in Electronics