LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Marcel Mateos Salles; Praney Goyal; Pradyut Sekhsaria; Hai Huang; Randall Balestriero

arXiv:2506.11402·cs.LG·October 2, 2025

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Marcel Mateos Salles, Praney Goyal, Pradyut Sekhsaria, Hai Huang, Randall Balestriero

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that Low-Rank Adaptation (LoRA) finetuning makes language models vulnerable to manipulation through spurious tokens, posing significant risks for AI safety and data integrity.

Contribution

The study introduces SSTI, a novel method to demonstrate how minimal spurious tokens can manipulate LoRA-finetuned models, highlighting security vulnerabilities.

Findings

01

LoRA models are highly sensitive to single spurious tokens.

02

Existing data sanitization methods fail to detect SSTI.

03

Vulnerabilities increase with resource-efficient LoRA setups.

Abstract

Large Language Models (LLMs) are commonly finetuned for a variety of use cases and domains. A common approach is to leverage Low-Rank Adaptation (LoRA) -- known to provide strong performance at low resource costs. In this study, we demonstrate that LoRA actually opens the door to short-cut vulnerabilities -- and the more resource efficient is the LoRA setup, the more vulnerable will be the finetuned model to aggressive attacks. To measure that vulnerability, we introduce Seamless Spurious Token Injection (SSTI), where we find that LoRA exclusively focuses on even just a single token that is spuriously correlated with downstream labels. In short, injection of that spurious token during finetuning ensure that the model's prediction at test-time can be manipulated on-demand. We conducted experiments across model families and datasets to evaluate the impact of SSTI during LoRA finetuning…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1. Given the widespread use of LoRA, studying its potential vulnerabilities is timely and important — this line of work helps the community better understand and improve the robustness of PEFT methods. 2. The paper is generally well written and easy to follow. The presentation makes the main ideas accessible. 3. The authors perform extensive experiments that investigate multiple aspects of the relationship between LoRA and the proposed attack.

Weaknesses

1. Novelty & relation to backdoor attacks. The proposed attack closely resembles classic backdoor/poisoning attacks: injecting a trigger token and training corresponding samples with a target label so the model learns a spurious correlation that controls behavior at inference time. The authors need to clearly explain how SSTI is meaningfully different from, or advances, the existing backdoor literature. Also this shortcut/spurious correlation phenomenon is well studied in the backdoor attack pa

Reviewer 02Rating 4Confidence 3

Strengths

1. The research topic is important. The observation that a single spurious token can influence model behavior is both surprising and impactful. 2. The paper is clearly written and easy to follow. Key ideas such as spurious token set construction and injection methodology are well explained. 3. The evaluation is thorough. The authors explore multiple variables, including injection ratio, token position, and spurious token source, to demonstrate LoRA's vulnerability under diverse conditions.

Weaknesses

1. The threat model needs further clarification. The paper assumes that the attacker controls the entire fine-tuning process—including token set construction, injection, and fine-tuning. However, in practice, such full control is rare. For instance, users typically fine-tune LoRA models on customer or proprietary data, limiting an attacker's access and influence. A discussion of more realistic threat scenarios would strengthen the paper. 2. The core finding is that LoRA is prone to overfitting

Reviewer 03Rating 2Confidence 3

Strengths

-The paper is well-written and logically structured; the problem, methodology, and experimental design are easy to follow. -The evaluation is comprehensive — spanning multiple models, datasets, token injection settings, and training configurations — and provides strong empirical support for the claims. -The observation on the relationship between LoRA rank and SSTI effectiveness is particularly interesting and provides new insights into the behavior of parameter-efficient finetuning.

Weaknesses

+ **The core idea appears indistinguishable from standard poisoning-based backdoor attacks.** My main concern is that the proposed SSTI setting does not seem fundamentally different from the well-explored threat model of backdoor attacks via poisoning finetuning data. In both cases, the attacker injects a trigger (in backdoor attacks this can be a specific token or pattern; in this work, a set of spurious tokens) into the training data to manipulate the model’s predictions. The only practica

Code & Models

Repositories

pradyut3501/spurious_corr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Robotics and Automated Systems