Limits of Convergence-Rate Control for Open-Weight Safety
Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad

TL;DR
This paper explores the theoretical limits of controlling the convergence rate of open-weight foundation models to prevent harmful fine-tuning, introducing spectral reparameterization and a novel algorithm, SpecDef.
Contribution
It develops a spectral reparameterization approach and the SpecDef algorithm to slow convergence, and establishes fundamental limits of convergence control in adversarial settings.
Findings
SpecDef can slow optimization in non-adversarial settings
Fundamental limits exist for convergence control against knowledgeable attackers
Controlling convergence rate alone is insufficient for robust safety in adversarial scenarios
Abstract
Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Smart Grid Security and Resilience
