LoRAGuard: An Effective Black-box Watermarking Approach for LoRAs
Peizhuo Lv, Yiran Xiahou, Congyi Li, Mengjie Sun, Shengzhi Zhang, Kai, Chen, Yingjun Zhang

TL;DR
LoRAGuard introduces a novel black-box watermarking method for LoRAs, effectively detecting unauthorized use even when multiple LoRAs are combined or negated, ensuring model traceability.
Contribution
The paper presents the Yin-Yang watermark technique and a shadow-model training approach, enhancing watermark robustness against complex LoRA manipulations.
Findings
Achieves nearly 100% watermark verification success.
Effective in language and diffusion models.
Robust against multiple combined or negated LoRAs.
Abstract
LoRA (Low-Rank Adaptation) has achieved remarkable success in the parameter-efficient fine-tuning of large models. The trained LoRA matrix can be integrated with the base model through addition or negation operation to improve performance on downstream tasks. However, the unauthorized use of LoRAs to generate harmful content highlights the need for effective mechanisms to trace their usage. A natural solution is to embed watermarks into LoRAs to detect unauthorized misuse. However, existing methods struggle when multiple LoRAs are combined or negation operation is applied, as these can significantly degrade watermark performance. In this paper, we introduce LoRAGuard, a novel black-box watermarking technique for detecting unauthorized misuse of LoRAs. To support both addition and negation operations, we propose the Yin-Yang watermark technique, where the Yin watermark is verified during…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper identifies a relevant problem - tracing unauthorized LoRA misuse is indeed important as these models become more widely shared and deployed in various applications. 2. The experimental evaluation covers multiple aspects including effectiveness under different numbers of LoRAs, various weight parameters, and robustness against fine-tuning and pruning attacks. 3. The approach works across different model types (language models and diffusion models), demonstrating some generality. 4. T
The paper tackles an interesting problem but falls short in several critical areas. The technical contribution feels incremental - essentially training two backdoors instead of one. While the shadow model training shows empirical benefits, the lack of principled justification makes it hard to understand when and why it works. The experimental evaluation, though covering multiple dimensions, remains limited in scope with only two base models tested. More concerning is the lack of comprehensive co
Given that the paper is fundamentally flawed in its method and experimental design (see weaknesses), I am unable to assess the strengths of the paper. For what it's worth, the author do seem to have a produced a robust and effective backdoor method ! But this is not what is advertised or tested in the paper.
**Lackluster experiments, low replicability**: Overall, the experimental study is insufficient. The method is evaluated on only 36 images generated by **watermarked** LoRAs and **0** (sic!) images generated by non-watermarked models. No comparison with another baseline is provided for comparison. Only one model per modality is used, and the stable diffusion model used is not even specified. Loras are trained in-house for Stable Diffusion on unspecified 10 images only. On the other hand, they are
1. The Yin–Yang construction is very simple and easy follow. 2. The paper tries both LLMs and diffusion models, evaluates addition and negation, and varies key factors.
1. It presumes the stolen LoRA is integrated into the same base model family and that owners can query the suspect system. Cross-base or cross-version behavior is not tested. 2. It is still a backdoor watermark, which is detectable under careful forensic analysis. This watermark can be removed under heavy retraining or carefully designed purifications, and dependent on trigger rarity. 3. There’s no formal analysis of query complexity, error rates, or optimal thresholds for black-box verification
Watermarking is definitely an important topic, making the subject of this work relevant. The proposed technique is lightweight and can be used without heavily modifying the weights of a pre-trained model. This is particularly important for large models that cannot be fully retrained or altered, as it avoids wasting resources. Strenghts: - The proposed approach is black-box, making it easy to deploy. - Experimental evaluation demonstrates the applicability of this approach to different model
- Security is purely heuristic, and no theoretical analysis of security is provided. Due to the fragility of watermarking, this makes the paper weak on this point. - The paper does not properly explain whether the watermarking scheme requires a secret key. Some sort of secret is required to have a robust security guarantee. - Several works have demonstrated (including from a theoretical perspective) that watermarking can be easily and generically removed. This work does not cite or compare its
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Digital Media Forensic Detection · Vehicle License Plate Recognition
MethodsDiffusion · Balanced Selection
