Self-Mined Hardness for Safety Fine-Tuning
Prakhar Gupta, Garv Shah, Donghua Zhang

TL;DR
This paper introduces a self-mined hardness approach for safety fine-tuning of language models, significantly reducing jailbreak success rates and increasing refusal rates on harmful prompts.
Contribution
It proposes a novel method that scores prompt difficulty based on the model's own judgments, enabling effective safety fine-tuning without curated adversarial datasets.
Findings
Reduces WildJailbreak attack success rate from 11.5-20.1% to 1-3%.
Increases refusal on benign jailbreak prompts from 14-22% to 74-94%.
Training on the hardest prompts decreases attack success rate by 35-50%.
Abstract
Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
