Self-Mined Hardness for Safety Fine-Tuning

Prakhar Gupta; Garv Shah; Donghua Zhang

arXiv:2605.03226·cs.LG·May 6, 2026

Self-Mined Hardness for Safety Fine-Tuning

Prakhar Gupta, Garv Shah, Donghua Zhang

PDF

TL;DR

This paper introduces a self-mined hardness approach for safety fine-tuning of language models, significantly reducing jailbreak success rates and increasing refusal rates on harmful prompts.

Contribution

It proposes a novel method that scores prompt difficulty based on the model's own judgments, enabling effective safety fine-tuning without curated adversarial datasets.

Findings

01

Reduces WildJailbreak attack success rate from 11.5-20.1% to 1-3%.

02

Increases refusal on benign jailbreak prompts from 14-22% to 74-94%.

03

Training on the hardest prompts decreases attack success rate by 35-50%.

Abstract

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.