Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary

TL;DR
This paper introduces a trigger-agnostic method to detect backdoored LoRA adapters by analyzing their weight matrices directly, achieving perfect accuracy across multiple model families without model execution.
Contribution
The authors propose a novel spectral analysis technique for weight matrices that enables effective backdoor detection in LoRA adapters without needing test data or model inference.
Findings
Detector achieves 100% accuracy across three model families.
Method is trigger-agnostic and does not require running the model.
Effective on adapters from diverse tasks like reasoning and classification.
Abstract
LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update , yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3},…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
