Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
Bhanu Pallakonda, Mikkel Hindsbo, Sina Ehsani, Prag Mishra

TL;DR
This paper introduces a novel method for embedding stealthy, malicious backdoors into tool-using large language models through a multi-stage fine-tuning process, revealing new risks in model deployment.
Contribution
The authors propose SFT-then-GRPO, a two-step fine-tuning framework that injects latent malicious behaviors into LLMs while maintaining performance on benign tasks.
Findings
Poisoned models perform well on standard benchmarks.
Backdoors can be concealed with benign responses after malicious actions.
Detection strategies include analyzing benchmark discrepancies.
Abstract
The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
