Get my drift? Catching LLM Task Drift with Activation Deltas
Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario, Fritz, Andrew Paverd

TL;DR
This paper introduces a method to detect task drift in LLMs caused by external data manipulation using activation deltas, enabling effective detection of prompt injections and malicious instructions without modifying the LLM.
Contribution
The study demonstrates that activation deltas can reliably detect task drift across various attack types and domains, with a simple linear classifier achieving near-perfect detection performance.
Findings
Activation deltas strongly correlate with task drift.
Linear classifier detects drift with near-perfect ROC AUC.
Method generalizes well to unseen attack types.
Abstract
LLMs are commonly used in retrieval-augmented applications to execute user instructions based on data from external sources. For example, modern search engines use LLMs to answer queries based on relevant search results; email plugins summarize emails by processing their content through an LLM. However, the potentially untrusted provenance of these data sources can lead to prompt injection attacks, where the LLM is manipulated by natural language instructions embedded in the external data, causing it to deviate from the user's original instruction(s). We define this deviation as task drift. Task drift is a significant concern as it allows attackers to exfiltrate data or influence the LLM's output for other users. We study LLM activations as a solution to detect task drift, showing that activation deltas - the difference in activations before and after processing external data - are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Data Quality and Management
