Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Thibaud Gloaguen; Mark Vero; Robin Staab; Martin Vechev

arXiv:2505.16567·cs.LG·October 10, 2025

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals a novel security vulnerability in finetuning large language models, where adversarial behaviors can be dormant until activated by downstream finetuning, posing significant risks.

Contribution

The authors introduce FAB, an attack method that creates LLMs with hidden adversarial behaviors activated during finetuning, challenging assumptions about finetuning security.

Findings

01

FAB effectively induces adversarial behaviors in multiple LLMs

02

The attack remains robust across various finetuning methods

03

Dormant adversarial behaviors activate during downstream finetuning

Abstract

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when…

Peer Reviews

Decision·ICLR 2026 ConditionalOral

Reviewer 01Rating 8Confidence 4

Strengths

This paper proposes a security threat model that, as far as I know, was not studied before in LLMs, and should be widely applicable due to the popularity of platforms like HuggingFace, and the dependence of modern ML on fine-tuning foundation models. This alone puts it in the "very novel" category. Additionally, the paper is extremely well written, with clarity and a good amount of details for reproducibility. The experiments are very extensive, and although small models are studied and the su

Weaknesses

One concern could be that the mitigation strategies suggested seem weak, and were not actually tested. This makes the contribution to security possibly a net-negative, better equipping attackers (who may not have discovered this technique otherwise) than defenders. Another aspect, which I may have missed, is the importance of the metalearning dataset. The authors argue that this is immaterial, but a proper study of the sensitivity to different small datasets for even a single condition would be

Reviewer 02Rating 4Confidence 4

Strengths

● The paper draws attention to an underexplored vulnerability in LLM-based reasoning systems by shifting the focus from output-level backdoors to those embedded within intermediate reasoning steps. ● The attack is simple but demonstrates clear efficacy across several reasoning tasks, showing that current defense mechanisms may fail to capture such latent threats. ● The threat model is timely, especially given the widespread use of chain-of-thought prompting and multi-step reasoning traces in m

Weaknesses

● The core idea (injecting backdoor triggers into intermediate reasoning steps) is conceptually interesting but technically shallow. The proposed attack does not introduce a new mechanism or model; it simply relocates standard output-level triggers into the reasoning trajectory without deeper algorithmic innovation. ● The paper lacks formalization of the attack space. There is no systematic analysis of what types of intermediate triggers are most effective, how their position influences activat

Reviewer 03Rating 8Confidence 4

Strengths

- Interesting problem setting: the problem setting considered by the authors is novel and interesting, as it shows how model misalignment can be made to emerge only after fine-tuning, rendering direct safety evaluations of the compromised model unable to detect the presence of an attack. - Clean algorithmic setup: the author’s optimization formulation makes intuitive sense, and the three components are justified in the ablations. In particular, adding noise seems to be the crucial innovation en

Weaknesses

- Impact on model utility: while the authors claim that the FAB model stays close to their instruction-tuned model on most capabilities benchmarks, some of the performance gaps seem quite big to me. For instance, ARC drops from 76.3% to 66.5% on PHI-2. I believe it would further strengthen this work if the authors could further explore the utility–ASR trade-off in FAB. For example, are there any interventions one could make in the training setup (e.g. fewer optimization steps, higher regularizat

Code & Models

Repositories

eth-sri/finetuning-activated-backdoors
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling