Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models
Hoang-Chau Luong, Lingwei Chen

TL;DR
This paper reveals that LoRA's failure to forget backdoors in language models is due to spectral weaknesses and introduces RoRA, a regularized method that enhances spectral properties to improve backdoor removal.
Contribution
The paper identifies spectral limitations as the cause of LoRA's ineffectiveness and proposes RoRA, a regularization technique that enhances spectral strength and alignment for better backdoor mitigation.
Findings
RoRA significantly reduces attack success rates.
RoRA maintains high clean accuracy.
Spectral enhancement improves backdoor forgetting.
Abstract
Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning
