Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
Wenhao Lan, Shan Li, Junbin Yang, Haihua Shen, Yijun Yang

TL;DR
This paper investigates how dynamic adversarial fine-tuning reorganizes the internal refusal geometry of safety-aligned language models, improving robustness against harmful requests while maintaining utility.
Contribution
It introduces R2D2, a novel adversarial fine-tuning method that reorganizes refusal geometry, enhancing robustness without sacrificing utility, and provides geometric and causal insights into this process.
Findings
R2D2 reduces attack success to zero early on but later recovers utility.
R2D2 preserves a late-layer admissible carrier and relocates it earlier, unlike SFT.
Effective rank remains stable, and causal interventions reveal low-dimensional utility-coupled carriers.
Abstract
Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, yet it remains unclear how dynamic adversarial fine-tuning changes the internal carriers of refusal. We study one 7B backbone under supervised fine-tuning (SFT) and under Robust Refusal Dynamic Defense (R2D2), a HarmBench-style adversarial fine-tuning procedure that repeatedly refreshes harmful training cases with current jailbreak attacks. Our protocol aligns fixed-source HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite, causal interventions, and a sparse adaptive stress test. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints, but that regime coincides with maximal XSTest refusal and complete failure on a benign-utility audit. Later checkpoints partially recover benign utility while partially reopening attack success. Sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
