Locking Pretrained Weights via Deep Low-Rank Residual Distillation
Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi

TL;DR
This paper introduces DLR-Lock, a novel method to secure pretrained language models by replacing their components with deep low-rank residual networks, making unauthorized adaptation computationally difficult.
Contribution
The paper proposes DLR-Lock, a defense mechanism that uses deep low-rank residual networks to lock pretrained models against unauthorized modifications.
Findings
DLR-Lock effectively withstands adaptive attacks with full knowledge of the defense.
It introduces architectural mismatches that hinder fine-tuning.
The method increases memory overhead during backpropagation, complicating model adaptation.
Abstract
The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
